Large Language Model (LLM)🤖: In and Out
Last Updated on June 3, 2024 by Editorial Team
Author(s): JAIGANESAN
Originally published on Towards AI.
Large Language Model (LLM)🤖: In and Out
Delving into the Architecture of LLM: Unraveling the Mechanics Behind Large Language Models like GPT, LLAMA, etc.
In this article, weβre going to explore the fundamental architecture of large language models built on the transformer architecture.
Before we begin, I assume you have a basic understanding of embeddings and neural networks. If youβre not familiar with how neural networks work, I recommend checking out my previous article on the topic.
My Version of the neural network or linear layer illustration might differ from what you have studied earlier. So, consider reading my article about neural networks.
Exploring Neural Networks: Fresh Perspective
Change your perspective on Neural Networks (MLP)!
medium.com
Note: When I refer to βDecoder Layerβ in some places, it doesnβt mean that itβs a decoder component or decodes something. The Large Language Model (LLM) follows the decoder architecture of the transformer or Encoder with linear and softmax layer, which is why Iβm using the terms decoder layer. And I also recommend, reading the article from start to end in sequential order, without jumping between sections. This will help you understand the concepts and flow of the article more clearly. And It will be long article , definetely it will make sense 😉.
Simple LLM Architecture ✌οΈ
To make things easier to understand, Iβve created a simple large language model (LLM) architecture [5] as an example.
Here are the details: our model has an embedding size of 1024, a hidden layer size of 4096 in the feedforward network, 8 heads in the multi-head attention mechanism, a Skip connection with RMS norm, and Iβve used the ReLU activation function.
Throughout this article, Iβll use the sentence βArtificial intelligence is not a substitute for human intelligenceβ with these 9 tokens to illustrate the architecture.
Understanding the Simple LLM Architecture 🔓
Take a look at Image 1, which shows a simple large language model (LLM) architecture. The βLβ in the diagram represents the number of decoder layers in the architecture, but in this article, weβll focus on the functionality of just one decoder layer.
Note: the numbers you see in the images are just random examples and donβt reflect the actual values used in the model. Theyβre only there to help illustrate the concept.
Before we dive in, letβs define what an LLM is. In simple terms, an LLM is a type of auto-regressive, probabilistic language model thatβs trained on text data and generates text based on that training. Itβs built on the transformer architecture.
Example sentence: βArtificial intelligence is not a substitute for human intelligenceβ
This sentence can be broken down into individual tokens, such as βartificialβ, βintelligenceβ, βisβ, βnotβ, βaβ, βsubstituteβ, βforβ, βhumanβ, and βintelligenceβ. Tokens are the smallest units of information, like small words, subwords, or characters. In a larger context, the training data can contain millions, billions, or even trillions of tokens.
Another important term to introduce is vocabulary. A vocabulary is a collection of unique words or tokens. In our case, weβve assumed a vocabulary size of 13,000. These tokens are then represented in a one-hot representation format ( each word represented into one hot, index will be one, other values will be zero ), which means a vocabulary index is generated for each token, as shown in Image 2.
Word Embedding 👶
Now, letβs talk about embeddings, which are a crucial part of the process. You might wonder, canβt we just represent words as index numbers? Well, yes, we could, but that approach would lead to a lot of errors, hallucinations, and incorrect sentences in generated responses.
Embeddings are a method of converting text or tokens into numerical representations as shown in Image 2, typically vectors. These numerical representations capture the semantic meaning of words, enabling the algorithm to process and analyze text more effectively.
For instance, the words βhumanβ and βdogβ might be in different positions in the vocabulary, but in vector space, theyβd be somewhere near each other. You can see this illustrated in Image 3 and Image 4.
Initially, word embeddings are generated randomly, as shown in Image 3. During the training process, these embeddings are adjusted based on the input data. Think of it like this: with each batch of data, the embeddings are updated or changed to better represent the words in the embedding space (Image 4).
To be clear, the embeddings donβt just magically move around β the embedding model (Neural networks) learns to represent the words ( In multi-dimensional vector space ) in a way that captures their meaning and relationships. This process happens over the course of training as the model is exposed to more and more text data.
Note: Image 3 and Image 4 represent the embedding space in 2 Dimension. In Our case, the model dimension size is 1024. Fun fact : Our Human brain is capable of understanding or observing only 3 dimensions.
Positional Embedding
Word embeddings are great, but theyβre missing one important piece of information: the position of the word/token in the sentence. To fix this, we need to add positional embeddings. Positional embeddings help the model understand the order of tokens since transformers inherently do not capture sequential information.
These can be either absolute or relative, but their job is to inject positional information into the input embedding.
Image 5 shows the formula for calculating positional embeddings. Each dimension of the positional embedding corresponds to a sinusoid. The good news is that this positional embedding is absolute and it doesnβt change for every batch. The same positional embedding will be used for all batches because it only represents positions.
To break down the formula, βposβ stands for position, βiβ represents the dimension in the embedding, and βd_modelβ is the modelβs dimension, which is 1024 in our case.
Iβve tried to put this formula into practice by creating a positional embedding for 9 tokens in Excel, as shown in Image 6. Each Embedding will have a size of 1024 dimensions.
Adding Positional Information to Word Embeddings 👱
Weβve finally reached the point where we can create our input embeddings! To do this, we simply add the word embedding and positional embedding together. This input embedding will be the input to our LLM decoder layers.
Hereβs the interesting part: even though the word embedding is the same for the same words, the input embedding is not. This is because the positional embedding adds a unique twist to each word based on its position in the sentence, as shown in Image 7.
This allows the model to capture both the semantic meaning of the words (through word embeddings) and their positional information in the sentence (through positional embeddings).
Understanding Self-Attention Mechanism 👻
Now that weβve covered input embeddings, itβs time to dive into the self-attention mechanism. To grasp multi-head attention, you need to understand self-attention clearly.
Self-attention has three main components: Query, Key, and Value. These three components are identical and come from the previous decoder layer or input embeddings.
So, what is self-attention? In simple terms, self-attention is a process where the input sequence is compared against itself to determine the importance of other words or tokens with respect to the current word. This allows the model to capture dependencies between words within the sequence, regardless of their positions. As a result, the model can understand the semantic and syntactic relationships between words in the input sequence.
As shown in Image 9, the Query and transposed Key are multiplied (matrix multiplication) to create a matrix of size (9, 9). In this scenario, each word will attend to the other words in the sequence and have attention information of other words in the respective rows.
Now, letβs talk about a crucial concept: causal masking. What is causal masking, and why do we apply it? Causal masking is a method that converts some values to -infinity. Since our LLM is an auto-regressive model, which means it generates the next word based on the context /previous tokens, we need to ensure that each token only has attention information about the current token and previous tokens.
To achieve this, we apply causal masking, which changes the information about the next tokens to -infinity. This is shown in Image 10.
By applying causal masking, we ensure that during training and inference, the next context/token is generated based on the current context/tokens. Causal masking ensures that the model cannot cheat by looking ahead during training, preserving the autoregressive property.
Calculating Attention Scores 🍀
Now that we have our matrix of size (9, 9) with the causal mask applied, itβs time to calculate the attention scores. To do this, we divide the dimension of the Key by the square root of the dimension, which is a common practice in attention mechanisms (Image 8).
Next, we apply the softmax function to the matrix, which converts the logits into a probability(Image 11) that sums to one (row). After applying the softmax function the -infinity values become zero or near zero value, so the current token doesnβt have the information about future tokens.
This attention score matrix shows how one word is related to others in the sequence. In multi-head attention, each attention head captures different aspects of the words, such as grammar, sequence of words, semantic relationships, and syntactic relationships.
As shown in Image 12, we multiply the attention score matrix with the Value component, resulting in a matrix of size (9, 1024). This is the output of the self-attention layer.
The process of multiplying the attention score with the Values is crucial, as it combines the information about how the words are related to each other based on attention score. This information is then fed into a feedforward neural network, which further processes the output to generate the final representation of the input sequence.
Multi-Head Attention 👻 👻
As I mentioned earlier, attention has three components: Query, Key, and Value. But in multi-head attention, we donβt just stop at one attention mechanism. Instead, we create multiple attention mechanisms, eight in our case. This means we have eight Query, Key, and Value tensors/vectors, each with a size of (9, 128) as shown in image 14.
Image Linear Shows the linear transformation in a Neural Network. x is the input matrix ( Query, key, Value, or output/input tensor in our case), and A is the weight matrix ( In our case W1, W2, W3). b is the bias matrix.
You might have noticed that the size has been reduced from 1024 to 128. Thatβs because weβre using eight heads, and we divide the model dimension (1024) by the number of heads (8) to get 128.
So, how are these tensors created? We multiply the input with eight weight matrices (Parameter matrices ), one for each head, to create the Query, Key, and Value vectors. For example, to create the eight Query tensors, we multiply the input with eight weight query matrices, each with a size of (1024, 128), resulting in eight Query tensors with a size of (9, 128).
These weight matrices are learnable parameters that are updated based on the loss, which helps the model capture different aspects of the input sequence in multi-head attention.
Now, you might be wondering what happens in each of these eight heads. Well, itβs similar to the self-attention mechanism we discussed earlier. In each head, we multiply the Query and transposed Key vectors, divide by the square root of the Key dimension, and then apply the softmax function to get the attention scores. These attention scores are then multiplied with the Value vector in each head.
This process gives us eight attention outputs (Image 16), each capturing different aspects of the input sequence.
Concatenating Multi-Head Attention Outputs 🚧
Now that we have eight attention output tensors, each capturing different aspects of the input sequence, we concatenate them as shown in Image 13 and Image 17. This results in a single output tensor with a size of (9, 1024), which is the same as the self-attention output and Model Input. However, this multi-head attention output is much richer in information, as it combines the insights from all eight attention heads.
The concatenated output is then multiplied with a weight matrix W_O of size (1024, 1024), as shown in Image 17. This final linear transformation produces the attention layer output, which still has a size of (9, 1024). This output now provides a more comprehensive understanding of the input sequence.
Entering the Feed Forward Network 🐾
Now that weβve explored the attention mechanism, itβs time to dive into the feed-forward network (FFN). The FFN consists of one hidden layer and one output layer (Image 18).
As I mentioned earlier, the hidden size of the FFN layer is 4096. This feed-forward network is essentially a neural network, and Image 19 illustrates the operations involved.
So, why do we need a feed-forward network layer? The attention layer helps capture the semantic and syntactic relationships between words in the sequence, but the feed-forward network layer learns to represent these relationships in a way that enables the large language model to generate text during inference.
Hereβs what happens: the FFN input with a size of (9, 1024) is multiplied with a weight matrix W1 of size (4096, 1024), resulting in an output size of (9, 4096). A bias vector b1 of size (1, 4096) is added (broadcasted to each row), and the ReLU activation function is applied to introduce non-linearity in the neural network (Image 20).
If youβre not familiar with this operation, I again recommend you to check out my article Neural Networks: Fresh perspective 😉. for better understanding.
The hidden layer output size of (9, 4096) is then multiplied with a weight matrix W2 of size (1024, 4096), resulting in an output size of (9, 1024). Finally, a bias matrix b2 of size (1, 1024) is added, resulting in the FFN layer output with a size of (9, 1024) as shown in Image 21.
And thatβs it! The final output of the feed-forward network is a tensor with a size of (9, 1024). This is the output of Decoder Layer 1.
Note that the same operation will be repeated for all L layers in the decoder. Each layer will have its own multi-head attention mechanism, followed by a feed-forward network, and the output of each layer will be used as the input to the next layer.
Skip Connection and RMS Norm 👴
Before we wrap up our discussion on the decoder layer, I want to quickly mention two important concepts: skip connections and RMS norm.
Why do we use skip connections? In simple terms, skip connections help us maintain the input information throughout the layer without changing too much. This is done by adding the input to its output, ensuring that the original information is preserved. For example, the attention layer input and output are added and given to the FFN input.
Image 22 β RMS Norm, gi β a learnable parameter.
Root Mean Square Normalization (RMS Norm)[6] is a technique used to normalize vectors while preserving the vector's direction and controlling the vectorβs magnitude. This normalization can help stabilize and improve the training of large language models.
With that, weβve covered the basics of one decoder layer in a transformer-based large language model!
Letβs do some math to calculate the number of parameters in our one decoder layer.
Multi-Head Attention Parameters
We have 3 weight matrices (Query, Key, Value) with 8 heads each, and each matrix has a dimension of 1024 x 128.
= 3 x 8 x 1024 x 128
= 3,145,728
One Linear Weight Matrix W_O in multi-head attention
We also have one linear weight matrix W_O with a dimension of 1024 x 1024.
= 1,048,576
The hidden layer(FFN) has a weight matrix with a dimension of 4096 x 1024, and a bias matrix with 4096 parameters.
= (4096 x 1024) + 4096
= 4,198,400
The output layer(FFN) has a weight matrix with a dimension of 1024 x 4096, and a bias matrix with 1024 parameters.
= (1024 x 4096) + 1024
= 4,195,328
Total Parameters in One Decoder Layer
Adding up all the parameters, we get:
= 3,145,728 + 1,048,576 + 4,198,400 + 4,195,328
= 12,588,032 (approximately 12.5 million parameters)
Thatβs a lot of parameters in one decoder layer! 😲 And we havenβt even counted the parameters in the embedding model and final linear layer. This is why large language models can have billions of parameters, like 8 billion, 70 billion, etc.
Linear layer and softmax function 🌳
Now that weβve got the basics covered, letβs dive into how inference works and explore the roles of linear layer and softmax function.
To start, the linear layer (aka the linear projection layer) takes the output from the last decoder layer and projects it onto a new space. In our case, the last decoder layerβs output is a matrix of size (9, 1024), which is then multiplied by the weight matrix of the linear layer, sized (13000, 1024). This results in an output of size (9, 13000), where 13000 represents the vocabulary size.
To simplify things as shown in Image 23 and Image 24, letβs focus on just one row of the input, say (1, 1024). This row contains information about the current token, βartificialβ. When we multiply this row with the weight matrix, we get a set of logits sized (1, 13000).
The softmax function is then applied to these logits to determine the most probable next token. In our example, the predicted token is βIntelligenceβ. The loss is also calculated based on this process.
During inference, this process is repeated for each token in the sequence. The first two words, artificial and Intelligence, are fed through all the layers again to predict the third word. Then, these first three words are fed through the layers to predict the fourth word, and so on, until we reach the end of the sequence.
Each token is generated based on the previously generated tokens and context. The virtual tokens [SOS](input) and [EOS](output) help during inference to identify the beginning and end of the sequence (Image 25).
Note : The weights and biases in the Multi-Head attention, FFN, and Linear layer are learnable parameters, meaning they get updated during the backpropagation. During training, the model adjusts these weights and biases to minimize the error and improve llm performance. Itβs all neuron operation , I just represent it in a different way.
During training, the virtual tokens [SOS] and [EOS] in image 25 are not universally necessary for all types of LLM training. This is because the model learns from the context of the training data and calculates its loss based on the sequence itself. It doesnβt need explicit markers to show where a sentence starts and ends. But, during inference, these tokens help indicate the start and end of the input sequence.
I know it can be tough to wrap your head around, so Iβm trying to explain it in a way thatβs easy to understand.
I believe I have made some sense of the Large Language model architecture and functions. If you found my article useful 👍, give it a👏! Feel free to follow for more insights. If you donβt understand, take some time to read again, It will definitely Make sense.
Letβs also stay in touch on 🔗LinkedIn🌏❤οΈto keep the conversation going!
References :
[1]. Ashish Vaswani, Noam Shazeer, Niki Parmar, Attention is All You Need Research Paper (2017)
[2]. Jay Alammar, Transformer illustration (2018)
[3]. Weight & Biases ML Article, Overview of LLM (2023)
[4]. Krish Naik, Transformer Architecture Explanation YouTube Video (2020)
[5]. LLAMA-2 Explanation (2024)
[6]. Biao Zhang, Rico Sennrich, Root Mean Square Layer Normalization (2019) Research Paper
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI