LLMs – How Do They Work?
Last Updated on June 13, 2024 by Editorial Team
Author(s): Eashan Mahajan
Originally published on Towards AI.
By now, everyoneβs become well-versed in the role of artificial intelligence. Headlines fill up with news from companies like OpenAI, Microsoft, and Google. With OpenAI releasing GPT-4o, itβs only increased the alarming amount of popularity AI possesses.
ChatGPT is also the perfect example of the most popular type of machine learning model: Large Language Models. When GPT was released, it also sent shockwaves throughout the world. Experts in AI were working with LLMs but didnβt see their true power until GPT was released. Now, millions have worked with LLMs, but the majority have yet to learn how LLMs work.
Weβll be learning about LLMs step by step, going into word vectors, then transforming our focus into transformers, and finally concluding with how these gigantic models train.
Word Vectors
To communicate with each other, humans developed languages, a form of communication that anyone can learn. Those languages contain words, which possess meaning and add emphasis to what someone says.
Word vectors are a way for machines to understand human language. They allow machines to represent words as numerical values in a continuous vector space. These vectors capture semantic relationships between words based on their usage and context within large texts. This allows for LLMs to process and understand natural language efficiently.
Letβs use an example. Weβll use the word βdogβ. Now, an easy way to find the word vector for βdogβ would be to use the Gensim library. So, weβll do just that.
import gensim.downloader as api
glove_vector = api.load('word2vec-google-news-300')
vec_dog = glove_vector['dog']
print(vec_dog)
Hereβs the output:
[ 5.12695312e-02 -2.23388672e-02 -1.72851562e-01 1.61132812e-01
-8.44726562e-02 5.73730469e-02 5.85937500e-02 -8.25195312e-02
-1.53808594e-02... 2.22656250e-01]
Thereβs exactly 300 numerical values in the vector that, when all together, represent the word βdogβ.
Now, each word vector represents a point in an imaginary βword spaceβ. Words that are evaluated to have similar meanings are placed closer together. For dogs, those words could be puppy, animal, etc. Since words are too complex to be evaluated in 2-D space, machines use vector spaces that have hundreds or thousands of dimensions. Thatβs how complex this analysis is.
However, there remains a critical issue within word vectors. That is, bias. As these word vectors are developed from how humans use those words, they end up reflecting a lot of biases that humans may have. Mitigating biases are a critical part of creating a strong machine-learning model, and researchers are actively trying to solve the issue.
Regardless, word vectors have proven themselves to be the most effective way of conveying human language to machines. Showcasing the relationship between words allows machines to understand natural language clearly and even use them in their outputs.
There are also words that possess multiple vectors. Again, these vectors are created based on the context that which they were used in a text, so there can be multiple vectors of the same word. LLMs such as ChatGPT use multiple vectors for the same word so they can effectively use the word in various contexts.
Now that weβve talked about the method that allows machines to flexibly represent a word letβs talk about how they accomplish that.
Transformers
What is a transformer? A transformer is a form of deep learning model architecture specifically designed for analyzing sequential data. They are especially useful within the field of natural language processing (NLP). They were introduced in the paper βAttention is All You Needβ by Ashish Vaswani and several other researchers, funded by Google.
The main aspect of transformers is the self-attention mechanism. Allowing the model to decide the importance of different words in a sentence when encoding a particular word, it a vital step in creating an LLM. Self-attention mechanisms allow for the model to capture long-range dependencies and contextual information much more effectively than other neural networks, such as RNNs and CNNs.
In addition to the self-attention mechanism, there are a couple more key components for transformers.
- Embedding Layer: This layer converts words (or better known as tokens) into word vectors
- Positional Encoding: Adding information about the position of each token or word in the sequence to the embedding. This will compensate for the lack of sequential processing in the transformer
- Layer Normalization & Residual Connections: These are techniques that are used to increase the training stability for the model and oversee the training of neural networks
- Encoder and Decoder: The encoder processes input text and extracts context. The decoder generates responses by being able to predict the next words in a sequence
- Feedforward Neural Networks: Both the encoder and decoder have FFNNs that apply extra transformations to the processed data. Allows them to capture more context and increase their understanding of NLP
Take a look at the image from βAttention is All You Needβ:
Powerful LLMβs often have close to or over hundreds of layers. GPT-1 had 12, GPT-2 had 48, and GPT-3 had 96 layers. LLMβs utilize these layers to keep track of pertinent information and input them into their responses.
One component I want to focus on is the feedforward neural networks. The intention of FFNNs are to capture the whole picture and extract as much information as possible. In this case, the information within a word vector will be transferred to the network. From there, the FFNNs will analyze each word vector and try to predict the next work in the sequence. However, they donβt analyze the words altogether. They analyze each word one at a time, but the network does have access to information that was copied by an attention head.
βAttention, head? Whatβs that?β Attention heads are found within the multi-head self-attention sub-layer. Instead of having a single attention mechanism, transformers will use multiple attention mechanisms. Each head has a different focus, focusing on different parts of the sequence. This allows them to capture diverse relationships and correlations in small parts of the sequence instead of focusing on everything at once.
In GPT-3, with the model having 96 layers, the model also has 96 attention heads per layer, with a dimension of 128 per head. The number of heads chosen for a model depends on the model size and the complexity of the tasks its assigned. Overall, though, the more heads a model has, the more nuanced relationships the model will catch.
Here are the steps in which a model will use the multi-head attention mechanism:
- First, the user will input what they want to be analyzed and expect a response from
- The sequence will eventually enter the multi-head self-attention sub-layer, where the attention heads reside. The sequence will be broken up according to how many attention heads are within the sub-layer.
- Here, the Residual and Layer normalization techniques will be applied in order to increase the training stability of the model.
- Next, the sequence will enter the FFNN layer and go through the process that the neural network does that I described above.
- Lastly, the Residual and Layer normalization techniques will be applied again
- Finally, a response will be outputted for the user
Training an LLM
Ok now that weβve discussed word vectors and transformers within an LLM, letβs talk about how to train one. Letβs go step by step.
Data Collection and Preprocessing
First, youβre going to need to do a massive amount of data collection from all types of sources: books, articles, websites, forums, and movies. Anything you can get your hand on, you can use it to train your LLM. As you can assume, the amount of data required to train one LLM is ridiculous. Just for GPT-3, OpenAI required hundreds of gigabytes to terabytes of text data to train the model. While quantity is important, the quality of the data is essential. Feeding the model false or bias data will result in inaccurate responses and require you to train the model again.
Next, youβre going to want to preprocess the data. Convert the text data you have into tokens. Tokens are the basic units of text that the model processes. A token can be any form of text, such as a word, character, or punctuation mark. Tokenization is the process of breaking text down into tokens. Byte Pair Encoding (BPE) was used for models such as ChatGPT. Normalization is an important technique here as well, as the text needs to be stripped of all special characters and whitespace while also being all lowercase.
Afterward, split the data into 3 sections: training, validation, and testing. Youβre going to want to shuffle the data around to ensure that each section is a representation of the entire dataset. Youβre going to want to take all of that data and encode it, which is turning all of the text into numerical representations or word vectors.
Model Architecture Design
After all of this data has been preprocessed, transition into working on your model architecture design. Decide the number of transformer layers, the number of attention heads for each multi-head self-attention mechanism (remember: the more heads you have, the more likely it is that your model will find complex patterns), and the size of the FFNN within each transformer layer.
Lastly, configure the dimension for token embeddings. Be careful here. The bigger each dimension, the more nuanced information the model can capture, but the computational cost will greatly increase as well.
Training Process
There are 5 things you want to make sure you do during this training process.
- Doing a forward pass. A forward pass is the process of pushing input data through multiple layers of a neural network in order to compute the output. It involves applying various linear and non-linear transformations to the input data. You want to do a forward pass for each batch of data.
- Loss calculations. Using an appropriate loss function, such as cross-entropy loss, which measures the difference between the true probability distribution and the predicted probability distribution.
- Doing a backward pass. A backward pass is the process of counting changes in weights using a gradient descent algorithm or something similar. Combined together, a forward and backward pass makeup 1 iteration, not to be confused with 1 epoch. An epoch is passing the entire data set in batches and contains multiple iterations.
- Optimization. You want to make sure youβre constantly updating the model parameters using an optimization algorithm such as the Adam Optimization Algorithm.
- Learning Rate. Make sure to adjust the learning rate over time.
After youβve done this multiple times (you wonβt be able to accomplish this within one go; youβre going to have to keep trying), apply regularization techniques, such as dropout regularization, which prevents overfitting. There are multiple regularization techniques that you can use; make sure to do some research into them.
Validation
During the training process, make sure you evaluate the model on your validation set every once in a while to monitor the performance. This will allow you to see what changes need to be made.
If need be, use early stopping to stop overfitting. This will occur if your validation loss starts to increase while your training loss decreases.
Once training is complete, evaluate the model on the test set to measure its performance. Make sure to use metrics such as accuracy and other specific metrics to assess the performance of your model.
Conclusion
Well, there you have it! Thatβs an introduction to how LLMs work and how you can create one. There is still a lot of research being done on LLMs, and there are continuous improvements being made. Itβs always a good idea to try out an LLM course online and maybe even try to build one. Large language models such as ChatGPT have been shown they revolutionize the technology industry, and I highly recommend looking into them more. For now, thatβs all Iβve got for you and thanks for reading!
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI