How are LLMs creative?
Author(s): Sushil Khadka
Originally published on Towards AI.
If youβve used any generative AI models such as GPT, Llama, etc., thereβs a good chance youβve encountered the term βtemperatureβ.
For starters, βtemperatureβ is a parameter that controls the creativity of generated content.β
But wait, does that sound super nerdy? When I first heard this, I thought there must be some cool physics going on. And why not? Thatβs what most people associate temperature with.
In this post, I explain temperature in generative AI models, especially LLMs, and show you mathematically how it works and powers creativity in these models.
If youβre unfamiliar with the term βtoken,β you can think of it as a βwordβ for simplicity.
Large Langauge Models (LLMs) belong to a family of autoregressive models. What do I mean by autoregressive models? In layman's terms, these are statistical models that use past values to predict future values. In the case of LLMs, past values are the tokens you input and future values are the generated tokens.
It should be noted that LLMs canβt generate a whole sentence at once because theyβre autoregressive models designed to predict just the next token. This token is then added to the input to produce another token, the chain continues until the end token <EOT> is generated, which signals the model to stop generation.
To generate the next token, an LLM outputs a list of probabilities for all possible tokens in its vocabulary.
If the vocabulary size of an LLM is 100, it means it can generateonly one of those 100 tokens at a time. It returns a probability score for each of these 100 tokens, signifying how likely that token is next in the sequence. In reality, vocabulary sizes are much larger; for instance, GPT-4V has a vocabulary size of 32K tokens.
In a nutshell, an LLM takes in an input sequence of tokens, processes it, and outputs a list of probabilities for each token in its vocabulary. Usually, the token with the highest probability is returned as the next token in the generation.
Softmax
The task of calculating a list of probabilities is carried out by a softmax layer. Every LLM has this βmagicalβ layer as its final layer, which takes in a logit vector (logits are unnormalized, raw scores associated with each token ) and outputs a proper probability distribution.
Letβs momentarily set softmax aside and create our own function to efficiently produce probabilities. Our function should meet specific criteria, if we succeed then why use Softmax??
- 1. Take an input vector and produce an equally sized output vector.
- 2. Ensure that each element in the output vector is non-negative (since probabilities canβt be negative)
- 3. Reflect that larger input values correspond to larger output values.
- 4. Ensure that all the elements in the output vector add up to 1 (since itβs a probability distribution)
To satisfy these criteria, Weβll perform an element-wise transformation. The transformation should be non-negative, meaning we need to use a function that always outputs positive integers. Additionally, it should be monotonically increasing, which ensures that the larger input values are always transformed into larger output values.
The eΛ£ function satisfies all the above conditions (1 to 3).
Our strategy will apply the eΛ£ function to each element in the input vector, resulting in a vector of the same length but with all the non-negative numbers.
But we missed the most important requirement, did you figure it out?
The values in the output vector are larger than 1. As we can see in the curve, at x = 0, y = 1, and if x increases beyond 0, y exceeds 1. This doesnβt work for our case because we need probabilities, which require each value to be smaller than 1 and for all of them to add up to 1.
To solve this issue, we can divide all the elements in the output vector by the sum of all the elements in the output vector. This ensures each value will be smaller than one and all will sum up to 1. This step is known as normalization.
Now that weβve arrived at Softmax, haha. Yes, this is exactly the Softmax function we derived to better understand, rather than just explain.
Note: The transforming function should be differential so that the loss can be propagated while training. This is yet another reason to choose βeβ in Softmax. In fact, I believe calculating the derivative of an exponent function is the easiest thing to do, π
Now that we understand how the probabilities are generated in an LLM using the Softmax function. Letβs explore how we can introduce some creativity into the model.
Creativity in AI models, really?
To introduce some creativity we need to βflattenβ the probability distribution generated by the softmax. What do I mean by flatten?
Letβs try to understand with an example,
Input to the LLM:
βComplete the conversation,
A: Hey, How you doinβ?
B:β
the LLM is now tasked to predict the first token B says,
For simplicity, letβs consider the vocab of only 5 words.
The model processes this input and produces a logit vector, which is to be converted to probabilities by Softmax. Say the logit vector, which is the input to the softmax layer, is [0.1, 0,0.5,1, 4, 0.6 ] for the tokens [βNi Haoβ, βKonnichiwaβ, βHolaβ, βNamasteβ, βHelloβ, βCiaoβ]. [0.01, 0.01 0.02 0.04 0.86 0.02] will be the output of Softmax.
Well, the chance of sampling βHelloβ (the fifth word in the list) as the next token is 86%. But thereβs no fun when one of the probability scores dominates, right? I mean itβs almost certain that βHelloβ will be selected as output. And whenever the same situation occurs the model is always likely to choose the token βHelloβ. This is very predictive, exactly the opposite of being creative, because thereβs no room for randomness and no scope for trying out different yet valid tokens.
βWhat if we have softmax output as [0.13 0.12 0.14 0.15 0.28 0.14] for the same tokens [βNi Haoβ, βKonnichiwaβ, βHolaβ, βNamasteβ, βHelloβ, βCiaoβ].
Thereβs still a high chance that βHelloβ will be sampled but other tokens also have a good chance compared to the last time.
In this case, the model can have a scope to slightly deviate from generating the usual βhelloβ and try out βNamasteβ, βHolaβ, etc, itβs basically incorporating little randomness. This will have a rippling effect and the whole conversation might go the most unexpected way, like B talking in Japanese and A trying to figure out the language spoken by A, who knows. Doesnβt this sound creative to you? For me, thereβs no better definition π
The goal of the temperature parameter (βTβ) is to control this deviation, the randomness in the generated content.
Letβs try to understand how we can modify the softmax function so that the output vector does not have one largely dominant probability score(i.e flattening the probability distribution)
Softmax function uses eΛ£ to transform each element from an input vector. This gives us the idea that we need to look into the eΛ£ function.
Letβs take a 2-dimensional input vector [1,2] as the simplest example to understand, and apply Softmax to it.
When we do element-wise eΛ£ transformation we get, [2.71, 7.39]
And the final softmax result would be [2.71/(2.71+7.39), 7.39/(2.71+7.39)]
= [0.25, 0.75]
Here, the second item is dominating with a large margin of 50%.
However, our goal is to bring the two values in softmax output closer, ainβt it?
Looking at the eΛ£ curve we can speculate that this difference is largely because of the nature of the ex-curve.
With that said, our objective now shifts to tweaking the eΛ£ function such that the difference between eΛ£(1) and eΛ£(2) is not this high. If ||eΛ£(1) –eΛ£(2) || is small, thatβll make the difference in probabilities small which essentially is called the flattening of the probabilities. In other words, reduce the steepness of the eΛ£ curve.
Well, we can see that the difference ||eΛ£(1) β eΛ£(2) || is smaller in the blue curve than in the orange curve.
Orange curve: eΛ£ function
Blue curve: e^(1/2 β x) function, we name the denominator as the temperature βTβ which is equals to 2 in this case.
If βTβ is increased the curve becomes less steep, meaning that the probability produced would be closer. Below is the link for the interactive desmos graph for Softmax with Temperature, make sure you play around to get a better intuition about controlling the steepness of the eΛ£ function.
temperature_softmax
Explore math with our beautiful, free online graphing calculator. Graph functions, plot points, visualize algebraicβ¦
www.desmos.com
You can skip this section as this is Calculus heavy and this serves as the proof of why e^(1/2 β x) has a smaller derivative than eΛ£ at any point.
We know that the derivative of the eΛ£ curve, d(eΛ£)/dx = eΛ£
To make || eΛ£(1) -eΛ£(2) || smaller, We have to lower the derivative of eΛ£ curve i.e d(eΛ£)/dx.
If you know some calculus, d(e^(1/t β x)) /dx = 1/t β e^(1/t β x)
We took t = 2, d(e ^(0.5 β x))/dx = 0.5 β e^(0.5 β x)
This shows that the function e^(0.5 β x) has a smaller gradient than that of the original function , eΛ£ at any point. Refer the blue curve for the new tweaked eΛ£ which e^(1/2 β x) function.
The Softmax process using the blue curve (T=2) would be
Notice that we had [0.25, 0.75] with vanilla Softmax but when the temperature (T) = 2, we got [0.38, 0.62] quite a progress, aha?
We can further flatten the probability distribution by increasing the parameter βTβ. If T = 1, it essentially becomes the vanilla Softmax, when T is set below 1 (T<1), the output becomes more predictable. And if you keep on increasing the Temperature, after a certain point, the generated content would look like gibberish text.
import numpy as np
def softmax(xs):
return np.exp(xs) / sum(np.exp(xs))
def softmax_t(xs, t):
return np.exp(xs/t) / sum(np.exp(xs/t))
xs = np.array([ 1 , 2 ])
print(softmax(xs))
print(softmax_t(xs, 2))
print(softmax_t(xs, 5))
OUTPUT:
[0.26894142 0.73105858] #T = 1
[0.37754067 0.62245933] #T = 2
[0.450166 0.549834] #T = 5
I asked the Mistral 7B model to write about Nepal. For the first text, the temperature βTβ is set to 1, and for the second one, I set the temperature βTβ to 2.
See the difference
T = 0.5
Nepal is a beautiful and diverse country located in South Asia, nestled between China and India. It is known for its stunning mountain landscapes, including the worldβs tallest peak, Mount Everest.
T = 2
In the heart of Nepal, where the temperature often drops to a chilly 4 degrees Celsius in the winter, you can find some of the most breathtaking snow-capped mountains in the world.
Donβt you think the second text has a more creative touch ??
In a nutshell, the temperature in physics is a degree of randomness of molecules, which gives us an idea about the thermal energy contained in a body.
The temperature in AI is also the measure of the randomness but of the generated content, this temperature results in a more creative generation.
Follow me for more such content. I write on AI, including Mathematics, Computer Vision, NLP, Data Science, and Probability & Stats.
References:
3blue1brownβs Youtube video on Eulerβs number.
https://en.wikipedia.org/wiki/Softmax_function
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI