Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Artificial Intelligence   Data Science   Latest   Machine Learning

How are LLMs creative?

Author(s): Sushil Khadka

Originally published on Towards AI.

If you’ve used any generative AI models such as GPT, Llama, etc., there’s a good chance you’ve encountered the term β€˜temperature’.

Photo by Khashayar Kouchpeydeh on Unsplash

For starters, β€˜temperature’ is a parameter that controls the creativity of generated content.”

But wait, does that sound super nerdy? When I first heard this, I thought there must be some cool physics going on. And why not? That’s what most people associate temperature with.

In this post, I explain temperature in generative AI models, especially LLMs, and show you mathematically how it works and powers creativity in these models.

If you’re unfamiliar with the term β€˜token,’ you can think of it as a β€˜word’ for simplicity.

Large Langauge Models (LLMs) belong to a family of autoregressive models. What do I mean by autoregressive models? In layman's terms, these are statistical models that use past values to predict future values. In the case of LLMs, past values are the tokens you input and future values are the generated tokens.

It should be noted that LLMs can’t generate a whole sentence at once because they’re autoregressive models designed to predict just the next token. This token is then added to the input to produce another token, the chain continues until the end token <EOT> is generated, which signals the model to stop generation.

To generate the next token, an LLM outputs a list of probabilities for all possible tokens in its vocabulary.

If the vocabulary size of an LLM is 100, it means it can generateonly one of those 100 tokens at a time. It returns a probability score for each of these 100 tokens, signifying how likely that token is next in the sequence. In reality, vocabulary sizes are much larger; for instance, GPT-4V has a vocabulary size of 32K tokens.

In a nutshell, an LLM takes in an input sequence of tokens, processes it, and outputs a list of probabilities for each token in its vocabulary. Usually, the token with the highest probability is returned as the next token in the generation.

Softmax

The task of calculating a list of probabilities is carried out by a softmax layer. Every LLM has this β€˜magical’ layer as its final layer, which takes in a logit vector (logits are unnormalized, raw scores associated with each token ) and outputs a proper probability distribution.

Let’s momentarily set softmax aside and create our own function to efficiently produce probabilities. Our function should meet specific criteria, if we succeed then why use Softmax??

  • 1. Take an input vector and produce an equally sized output vector.
  • 2. Ensure that each element in the output vector is non-negative (since probabilities can’t be negative)
  • 3. Reflect that larger input values correspond to larger output values.
  • 4. Ensure that all the elements in the output vector add up to 1 (since it’s a probability distribution)

To satisfy these criteria, We’ll perform an element-wise transformation. The transformation should be non-negative, meaning we need to use a function that always outputs positive integers. Additionally, it should be monotonically increasing, which ensures that the larger input values are always transformed into larger output values.

The eΛ£ function satisfies all the above conditions (1 to 3).

Our strategy will apply the eΛ£ function to each element in the input vector, resulting in a vector of the same length but with all the non-negative numbers.

But we missed the most important requirement, did you figure it out?

The values in the output vector are larger than 1. As we can see in the curve, at x = 0, y = 1, and if x increases beyond 0, y exceeds 1. This doesn’t work for our case because we need probabilities, which require each value to be smaller than 1 and for all of them to add up to 1.

To solve this issue, we can divide all the elements in the output vector by the sum of all the elements in the output vector. This ensures each value will be smaller than one and all will sum up to 1. This step is known as normalization.

Now that we’ve arrived at Softmax, haha. Yes, this is exactly the Softmax function we derived to better understand, rather than just explain.

Note: The transforming function should be differential so that the loss can be propagated while training. This is yet another reason to choose β€˜e’ in Softmax. In fact, I believe calculating the derivative of an exponent function is the easiest thing to do, πŸ™‚

Now that we understand how the probabilities are generated in an LLM using the Softmax function. Let’s explore how we can introduce some creativity into the model.

Creativity in AI models, really?

To introduce some creativity we need to β€œflatten” the probability distribution generated by the softmax. What do I mean by flatten?

Let’s try to understand with an example,

Input to the LLM:

β€œComplete the conversation,

A: Hey, How you doin’?

B:”

the LLM is now tasked to predict the first token B says,

For simplicity, let’s consider the vocab of only 5 words.

The model processes this input and produces a logit vector, which is to be converted to probabilities by Softmax. Say the logit vector, which is the input to the softmax layer, is [0.1, 0,0.5,1, 4, 0.6 ] for the tokens [β€˜Ni Hao’, β€˜Konnichiwa’, β€˜Hola’, β€˜Namaste’, β€˜Hello’, β€˜Ciao’]. [0.01, 0.01 0.02 0.04 0.86 0.02] will be the output of Softmax.

Well, the chance of sampling β€˜Hello’ (the fifth word in the list) as the next token is 86%. But there’s no fun when one of the probability scores dominates, right? I mean it’s almost certain that β€˜Hello’ will be selected as output. And whenever the same situation occurs the model is always likely to choose the token β€œHello”. This is very predictive, exactly the opposite of being creative, because there’s no room for randomness and no scope for trying out different yet valid tokens.

β€œWhat if we have softmax output as [0.13 0.12 0.14 0.15 0.28 0.14] for the same tokens [β€˜Ni Hao’, β€˜Konnichiwa’, β€˜Hola’, β€˜Namaste’, β€˜Hello’, β€˜Ciao’].

There’s still a high chance that β€œHello” will be sampled but other tokens also have a good chance compared to the last time.

In this case, the model can have a scope to slightly deviate from generating the usual β€œhello” and try out β€œNamaste”, β€œHola”, etc, it’s basically incorporating little randomness. This will have a rippling effect and the whole conversation might go the most unexpected way, like B talking in Japanese and A trying to figure out the language spoken by A, who knows. Doesn’t this sound creative to you? For me, there’s no better definition πŸ™‚

The goal of the temperature parameter (β€˜T’) is to control this deviation, the randomness in the generated content.

Let’s try to understand how we can modify the softmax function so that the output vector does not have one largely dominant probability score(i.e flattening the probability distribution)

Softmax function uses eΛ£ to transform each element from an input vector. This gives us the idea that we need to look into the eΛ£ function.

Let’s take a 2-dimensional input vector [1,2] as the simplest example to understand, and apply Softmax to it.

When we do element-wise eΛ£ transformation we get, [2.71, 7.39]

And the final softmax result would be [2.71/(2.71+7.39), 7.39/(2.71+7.39)]

= [0.25, 0.75]

Here, the second item is dominating with a large margin of 50%.

However, our goal is to bring the two values in softmax output closer, ain’t it?

Looking at the eΛ£ curve we can speculate that this difference is largely because of the nature of the ex-curve.

With that said, our objective now shifts to tweaking the eΛ£ function such that the difference between eΛ£(1) and eΛ£(2) is not this high. If ||eΛ£(1) –eΛ£(2) || is small, that’ll make the difference in probabilities small which essentially is called the flattening of the probabilities. In other words, reduce the steepness of the eΛ£ curve.

Well, we can see that the difference ||eΛ£(1) β€” eΛ£(2) || is smaller in the blue curve than in the orange curve.

Orange curve: eΛ£ function

Blue curve: e^(1/2 β‹… x) function, we name the denominator as the temperature β€˜T’ which is equals to 2 in this case.

If β€˜T’ is increased the curve becomes less steep, meaning that the probability produced would be closer. Below is the link for the interactive desmos graph for Softmax with Temperature, make sure you play around to get a better intuition about controlling the steepness of the eΛ£ function.

temperature_softmax

Explore math with our beautiful, free online graphing calculator. Graph functions, plot points, visualize algebraic…

www.desmos.com

You can skip this section as this is Calculus heavy and this serves as the proof of why e^(1/2 β‹… x) has a smaller derivative than eΛ£ at any point.

We know that the derivative of the eΛ£ curve, d(eΛ£)/dx = eΛ£

To make || eΛ£(1) -eΛ£(2) || smaller, We have to lower the derivative of eΛ£ curve i.e d(eΛ£)/dx.

If you know some calculus, d(e^(1/t β‹… x)) /dx = 1/t β‹… e^(1/t β‹… x)

We took t = 2, d(e ^(0.5 β‹… x))/dx = 0.5 β‹… e^(0.5 β‹… x)

This shows that the function e^(0.5 β‹… x) has a smaller gradient than that of the original function , eΛ£ at any point. Refer the blue curve for the new tweaked eΛ£ which e^(1/2 β‹… x) function.

The Softmax process using the blue curve (T=2) would be

Notice that we had [0.25, 0.75] with vanilla Softmax but when the temperature (T) = 2, we got [0.38, 0.62] quite a progress, aha?

We can further flatten the probability distribution by increasing the parameter β€˜T’. If T = 1, it essentially becomes the vanilla Softmax, when T is set below 1 (T<1), the output becomes more predictable. And if you keep on increasing the Temperature, after a certain point, the generated content would look like gibberish text.

import numpy as np


def softmax(xs):
return np.exp(xs) / sum(np.exp(xs))
def softmax_t(xs, t):
return np.exp(xs/t) / sum(np.exp(xs/t))
xs = np.array([ 1 , 2 ])
print(softmax(xs))
print(softmax_t(xs, 2))
print(softmax_t(xs, 5))
OUTPUT:
[0.26894142 0.73105858] #T = 1
[0.37754067 0.62245933] #T = 2
[0.450166 0.549834] #T = 5

I asked the Mistral 7B model to write about Nepal. For the first text, the temperature β€˜T’ is set to 1, and for the second one, I set the temperature β€˜T’ to 2.

See the difference

T = 0.5

Nepal is a beautiful and diverse country located in South Asia, nestled between China and India. It is known for its stunning mountain landscapes, including the world’s tallest peak, Mount Everest.

T = 2

In the heart of Nepal, where the temperature often drops to a chilly 4 degrees Celsius in the winter, you can find some of the most breathtaking snow-capped mountains in the world.

Don’t you think the second text has a more creative touch ??

In a nutshell, the temperature in physics is a degree of randomness of molecules, which gives us an idea about the thermal energy contained in a body.

The temperature in AI is also the measure of the randomness but of the generated content, this temperature results in a more creative generation.

Follow me for more such content. I write on AI, including Mathematics, Computer Vision, NLP, Data Science, and Probability & Stats.

References:

3blue1brown’s Youtube video on Euler’s number.

https://en.wikipedia.org/wiki/Softmax_function

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓