Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

How are LLMs creative?

Originally published on Towards AI.

If you’ve used any generative AI models such as GPT, Llama, etc., there’s a good chance you’ve encountered the term ‘temperature’.

For starters, ‘temperature’ is a parameter that controls the creativity of generated content.”

But wait, does that sound super nerdy? When I first heard this, I thought there must be some cool physics going on. And why not? That’s what most people associate temperature with.

In this post, I explain temperature in generative AI models, especially LLMs, and show you mathematically how it works and powers creativity in these models.

If you’re unfamiliar with the term ‘token,’ you can think of it as a ‘word’ for simplicity.

Large Langauge Models (LLMs) belong to a family of autoregressive models. What do I mean by autoregressive models? In layman's terms, these are statistical models that use past values to predict future values. In the case of LLMs, past values are the tokens you input and future values are the generated tokens.

It should be noted that LLMs can’t generate a whole sentence at once because they’re autoregressive models designed to predict just the next token. This token is then added to the input to produce another token, the chain continues until the end token <EOT> is generated, which signals the model to stop generation.

To generate the next token, an LLM outputs a list of probabilities for all possible tokens in its vocabulary.

If the vocabulary size of an LLM is 100, it means it can generateonly one of those 100 tokens at a time. It returns a probability score for each of these 100 tokens, signifying how likely that token is next in the sequence. In reality, vocabulary sizes are much larger; for instance, GPT-4V has a vocabulary size of 32K tokens.

In a nutshell, an LLM takes in an input sequence of tokens, processes it, and outputs a list of probabilities for each token in its vocabulary. Usually, the token with the highest probability is returned as the next token in the generation.

Softmax

The task of calculating a list of probabilities is carried out by a softmax layer. Every LLM has this ‘magical’ layer as its final layer, which takes in a logit vector (logits are unnormalized, raw scores associated with each token ) and outputs a proper probability distribution.

Let’s momentarily set softmax aside and create our own function to efficiently produce probabilities. Our function should meet specific criteria, if we succeed then why use Softmax??

• 1. Take an input vector and produce an equally sized output vector.
• 2. Ensure that each element in the output vector is non-negative (since probabilities can’t be negative)
• 3. Reflect that larger input values correspond to larger output values.
• 4. Ensure that all the elements in the output vector add up to 1 (since it’s a probability distribution)

To satisfy these criteria, We’ll perform an element-wise transformation. The transformation should be non-negative, meaning we need to use a function that always outputs positive integers. Additionally, it should be monotonically increasing, which ensures that the larger input values are always transformed into larger output values.

The function satisfies all the above conditions (1 to 3).

Our strategy will apply the eˣ function to each element in the input vector, resulting in a vector of the same length but with all the non-negative numbers.

But we missed the most important requirement, did you figure it out?

The values in the output vector are larger than 1. As we can see in the curve, at x = 0, y = 1, and if x increases beyond 0, y exceeds 1. This doesn’t work for our case because we need probabilities, which require each value to be smaller than 1 and for all of them to add up to 1.

To solve this issue, we can divide all the elements in the output vector by the sum of all the elements in the output vector. This ensures each value will be smaller than one and all will sum up to 1. This step is known as normalization.

Now that we’ve arrived at Softmax, haha. Yes, this is exactly the Softmax function we derived to better understand, rather than just explain.

Note: The transforming function should be differential so that the loss can be propagated while training. This is yet another reason to choose ‘e’ in Softmax. In fact, I believe calculating the derivative of an exponent function is the easiest thing to do, 🙂

Now that we understand how the probabilities are generated in an LLM using the Softmax function. Let’s explore how we can introduce some creativity into the model.

Creativity in AI models, really?

To introduce some creativity we need to “flatten” the probability distribution generated by the softmax. What do I mean by flatten?

Let’s try to understand with an example,

Input to the LLM:

“Complete the conversation,

A: Hey, How you doin’?

B:”

the LLM is now tasked to predict the first token B says,

For simplicity, let’s consider the vocab of only 5 words.

The model processes this input and produces a logit vector, which is to be converted to probabilities by Softmax. Say the logit vector, which is the input to the softmax layer, is [0.1, 0,0.5,1, 4, 0.6 ] for the tokens [‘Ni Hao’, ‘Konnichiwa’, ‘Hola’, ‘Namaste’, ‘Hello’, ‘Ciao’]. [0.01, 0.01 0.02 0.04 0.86 0.02] will be the output of Softmax.

Well, the chance of sampling ‘Hello’ (the fifth word in the list) as the next token is 86%. But there’s no fun when one of the probability scores dominates, right? I mean it’s almost certain that ‘Hello’ will be selected as output. And whenever the same situation occurs the model is always likely to choose the token “Hello”. This is very predictive, exactly the opposite of being creative, because there’s no room for randomness and no scope for trying out different yet valid tokens.

“What if we have softmax output as [0.13 0.12 0.14 0.15 0.28 0.14] for the same tokens [‘Ni Hao’, ‘Konnichiwa’, ‘Hola’, ‘Namaste’, ‘Hello’, ‘Ciao’].

There’s still a high chance that “Hello” will be sampled but other tokens also have a good chance compared to the last time.

In this case, the model can have a scope to slightly deviate from generating the usual “hello” and try out “Namaste”, “Hola”, etc, it’s basically incorporating little randomness. This will have a rippling effect and the whole conversation might go the most unexpected way, like B talking in Japanese and A trying to figure out the language spoken by A, who knows. Doesn’t this sound creative to you? For me, there’s no better definition 🙂

The goal of the temperature parameter (‘T’) is to control this deviation, the randomness in the generated content.

Let’s try to understand how we can modify the softmax function so that the output vector does not have one largely dominant probability score(i.e flattening the probability distribution)

Softmax function uses to transform each element from an input vector. This gives us the idea that we need to look into the function.

Let’s take a 2-dimensional input vector [1,2] as the simplest example to understand, and apply Softmax to it.

When we do element-wise transformation we get, [2.71, 7.39]

And the final softmax result would be [2.71/(2.71+7.39), 7.39/(2.71+7.39)]

= [0.25, 0.75]

Here, the second item is dominating with a large margin of 50%.

However, our goal is to bring the two values in softmax output closer, ain’t it?

Looking at the curve we can speculate that this difference is largely because of the nature of the ex-curve.

With that said, our objective now shifts to tweaking the function such that the difference between (1) and (2) is not this high. If ||(1) –(2) || is small, that’ll make the difference in probabilities small which essentially is called the flattening of the probabilities. In other words, reduce the steepness of the curve.

Well, we can see that the difference ||(1) — (2) || is smaller in the blue curve than in the orange curve.

Orange curve: function

Blue curve: e^(1/2 x) function, we name the denominator as the temperature ‘T’ which is equals to 2 in this case.

If ‘T’ is increased the curve becomes less steep, meaning that the probability produced would be closer. Below is the link for the interactive desmos graph for Softmax with Temperature, make sure you play around to get a better intuition about controlling the steepness of the function.

temperature_softmax

Explore math with our beautiful, free online graphing calculator. Graph functions, plot points, visualize algebraic…

www.desmos.com

You can skip this section as this is Calculus heavy and this serves as the proof of why e^(1/2 x) has a smaller derivative than at any point.

We know that the derivative of the curve, d(eˣ)/dx = eˣ

To make || eˣ(1) -eˣ(2) || smaller, We have to lower the derivative of curve i.e d(eˣ)/dx.

If you know some calculus, d(e^(1/t x)) /dx = 1/t e^(1/t x)

We took t = 2, d(e ^(0.5 x))/dx = 0.5 e^(0.5 x)

This shows that the function e^(0.5 x) has a smaller gradient than that of the original function , at any point. Refer the blue curve for the new tweaked which e^(1/2 x) function.

The Softmax process using the blue curve (T=2) would be

Notice that we had [0.25, 0.75] with vanilla Softmax but when the temperature (T) = 2, we got [0.38, 0.62] quite a progress, aha?

We can further flatten the probability distribution by increasing the parameter ‘T’. If T = 1, it essentially becomes the vanilla Softmax, when T is set below 1 (T<1), the output becomes more predictable. And if you keep on increasing the Temperature, after a certain point, the generated content would look like gibberish text.

`import numpy as npdef softmax(xs): return np.exp(xs) / sum(np.exp(xs))def softmax_t(xs, t): return np.exp(xs/t) / sum(np.exp(xs/t))xs = np.array([ 1 , 2 ])print(softmax(xs)) print(softmax_t(xs, 2)) print(softmax_t(xs, 5)) OUTPUT:[0.26894142 0.73105858] #T = 1[0.37754067 0.62245933] #T = 2[0.450166 0.549834] #T = 5`

I asked the Mistral 7B model to write about Nepal. For the first text, the temperature ‘T’ is set to 1, and for the second one, I set the temperature ‘T’ to 2.

See the difference

T = 0.5

Nepal is a beautiful and diverse country located in South Asia, nestled between China and India. It is known for its stunning mountain landscapes, including the world’s tallest peak, Mount Everest.

T = 2

In the heart of Nepal, where the temperature often drops to a chilly 4 degrees Celsius in the winter, you can find some of the most breathtaking snow-capped mountains in the world.

Don’t you think the second text has a more creative touch ??

In a nutshell, the temperature in physics is a degree of randomness of molecules, which gives us an idea about the thermal energy contained in a body.

The temperature in AI is also the measure of the randomness but of the generated content, this temperature results in a more creative generation.

Follow me for more such content. I write on AI, including Mathematics, Computer Vision, NLP, Data Science, and Probability & Stats.

References:

3blue1brown’s Youtube video on Euler’s number.

https://en.wikipedia.org/wiki/Softmax_function

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI