Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Unlock the full potential of AI with Building LLMs for Productionβ€”our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

Large Language Model (LLM)🤖: In and Out
Artificial Intelligence   Latest   Machine Learning

Large Language Model (LLM)🤖: In and Out

Last Updated on June 3, 2024 by Editorial Team

Author(s): JAIGANESAN

Originally published on Towards AI.

Large Language Model (LLM)🤖: In and Out

Delving into the Architecture of LLM: Unraveling the Mechanics Behind Large Language Models like GPT, LLAMA, etc.

Photo by Tara Winstead: pexels.com

In this article, we’re going to explore the fundamental architecture of large language models built on the transformer architecture.

Before we begin, I assume you have a basic understanding of embeddings and neural networks. If you’re not familiar with how neural networks work, I recommend checking out my previous article on the topic.

My Version of the neural network or linear layer illustration might differ from what you have studied earlier. So, consider reading my article about neural networks.

Exploring Neural Networks: Fresh Perspective

Change your perspective on Neural Networks (MLP)!

medium.com

Note: When I refer to β€œDecoder Layer” in some places, it doesn’t mean that it’s a decoder component or decodes something. The Large Language Model (LLM) follows the decoder architecture of the transformer or Encoder with linear and softmax layer, which is why I’m using the terms decoder layer. And I also recommend, reading the article from start to end in sequential order, without jumping between sections. This will help you understand the concepts and flow of the article more clearly. And It will be long article , definetely it will make sense 😉.

Simple LLM Architecture ✌️

To make things easier to understand, I’ve created a simple large language model (LLM) architecture [5] as an example.

Here are the details: our model has an embedding size of 1024, a hidden layer size of 4096 in the feedforward network, 8 heads in the multi-head attention mechanism, a Skip connection with RMS norm, and I’ve used the ReLU activation function.

Throughout this article, I’ll use the sentence β€œArtificial intelligence is not a substitute for human intelligence” with these 9 tokens to illustrate the architecture.

Image 1: Created by the author

Understanding the Simple LLM Architecture 🔓

Take a look at Image 1, which shows a simple large language model (LLM) architecture. The β€œL” in the diagram represents the number of decoder layers in the architecture, but in this article, we’ll focus on the functionality of just one decoder layer.

Note: the numbers you see in the images are just random examples and don’t reflect the actual values used in the model. They’re only there to help illustrate the concept.

Before we dive in, let’s define what an LLM is. In simple terms, an LLM is a type of auto-regressive, probabilistic language model that’s trained on text data and generates text based on that training. It’s built on the transformer architecture.

Image 2: Created by the author

Example sentence: β€œArtificial intelligence is not a substitute for human intelligence”

This sentence can be broken down into individual tokens, such as β€˜artificial’, β€˜intelligence’, β€˜is’, β€˜not’, β€˜a’, ’substitute’, β€˜for’, β€˜human’, and β€˜intelligence’. Tokens are the smallest units of information, like small words, subwords, or characters. In a larger context, the training data can contain millions, billions, or even trillions of tokens.

Another important term to introduce is vocabulary. A vocabulary is a collection of unique words or tokens. In our case, we’ve assumed a vocabulary size of 13,000. These tokens are then represented in a one-hot representation format ( each word represented into one hot, index will be one, other values will be zero ), which means a vocabulary index is generated for each token, as shown in Image 2.

Word Embedding 👶

Now, let’s talk about embeddings, which are a crucial part of the process. You might wonder, can’t we just represent words as index numbers? Well, yes, we could, but that approach would lead to a lot of errors, hallucinations, and incorrect sentences in generated responses.

Embeddings are a method of converting text or tokens into numerical representations as shown in Image 2, typically vectors. These numerical representations capture the semantic meaning of words, enabling the algorithm to process and analyze text more effectively.

For instance, the words β€œhuman” and β€œdog” might be in different positions in the vocabulary, but in vector space, they’d be somewhere near each other. You can see this illustrated in Image 3 and Image 4.

Image 3: Created by the author β†’ Initially, embeddings are initialized randomly

Initially, word embeddings are generated randomly, as shown in Image 3. During the training process, these embeddings are adjusted based on the input data. Think of it like this: with each batch of data, the embeddings are updated or changed to better represent the words in the embedding space (Image 4).

To be clear, the embeddings don’t just magically move around β€” the embedding model (Neural networks) learns to represent the words ( In multi-dimensional vector space ) in a way that captures their meaning and relationships. This process happens over the course of training as the model is exposed to more and more text data.

Image 4: Created by the author β†’ Embeddings β€” after training the model

Note: Image 3 and Image 4 represent the embedding space in 2 Dimension. In Our case, the model dimension size is 1024. Fun fact : Our Human brain is capable of understanding or observing only 3 dimensions.

Positional Embedding

Word embeddings are great, but they’re missing one important piece of information: the position of the word/token in the sentence. To fix this, we need to add positional embeddings. Positional embeddings help the model understand the order of tokens since transformers inherently do not capture sequential information.

These can be either absolute or relative, but their job is to inject positional information into the input embedding.

Image 5: Resource β€” Attention is all you need, Research paper ( Positional Embedding formula β€” sinusoidal)

Image 5 shows the formula for calculating positional embeddings. Each dimension of the positional embedding corresponds to a sinusoid. The good news is that this positional embedding is absolute and it doesn’t change for every batch. The same positional embedding will be used for all batches because it only represents positions.

To break down the formula, β€œpos” stands for position, β€œi” represents the dimension in the embedding, and β€œd_model” is the model’s dimension, which is 1024 in our case.

Image 6: Created by the author β†’ Positional Embedding vector

I’ve tried to put this formula into practice by creating a positional embedding for 9 tokens in Excel, as shown in Image 6. Each Embedding will have a size of 1024 dimensions.

Adding Positional Information to Word Embeddings 👱

We’ve finally reached the point where we can create our input embeddings! To do this, we simply add the word embedding and positional embedding together. This input embedding will be the input to our LLM decoder layers.

Image 7: Created by the author β†’ Input Embedding

Here’s the interesting part: even though the word embedding is the same for the same words, the input embedding is not. This is because the positional embedding adds a unique twist to each word based on its position in the sentence, as shown in Image 7.

This allows the model to capture both the semantic meaning of the words (through word embeddings) and their positional information in the sentence (through positional embeddings).

Understanding Self-Attention Mechanism 👻

Now that we’ve covered input embeddings, it’s time to dive into the self-attention mechanism. To grasp multi-head attention, you need to understand self-attention clearly.

Self-attention has three main components: Query, Key, and Value. These three components are identical and come from the previous decoder layer or input embeddings.

Image 8: Source β€” Attention is All You Need, Research paper

So, what is self-attention? In simple terms, self-attention is a process where the input sequence is compared against itself to determine the importance of other words or tokens with respect to the current word. This allows the model to capture dependencies between words within the sequence, regardless of their positions. As a result, the model can understand the semantic and syntactic relationships between words in the input sequence.

Image 9: Created by the author β†’ Query and Key Multiplication

As shown in Image 9, the Query and transposed Key are multiplied (matrix multiplication) to create a matrix of size (9, 9). In this scenario, each word will attend to the other words in the sequence and have attention information of other words in the respective rows.

Image 10: Created by the author β†’ Causal mask

Now, let’s talk about a crucial concept: causal masking. What is causal masking, and why do we apply it? Causal masking is a method that converts some values to -infinity. Since our LLM is an auto-regressive model, which means it generates the next word based on the context /previous tokens, we need to ensure that each token only has attention information about the current token and previous tokens.

To achieve this, we apply causal masking, which changes the information about the next tokens to -infinity. This is shown in Image 10.

By applying causal masking, we ensure that during training and inference, the next context/token is generated based on the current context/tokens. Causal masking ensures that the model cannot cheat by looking ahead during training, preserving the autoregressive property.

Calculating Attention Scores 🍀

Now that we have our matrix of size (9, 9) with the causal mask applied, it’s time to calculate the attention scores. To do this, we divide the dimension of the Key by the square root of the dimension, which is a common practice in attention mechanisms (Image 8).

Image 11: Created by the author β†’ Attention score

Next, we apply the softmax function to the matrix, which converts the logits into a probability(Image 11) that sums to one (row). After applying the softmax function the -infinity values become zero or near zero value, so the current token doesn’t have the information about future tokens.

This attention score matrix shows how one word is related to others in the sequence. In multi-head attention, each attention head captures different aspects of the words, such as grammar, sequence of words, semantic relationships, and syntactic relationships.

Image 12: Created by the author β†’ Self-attention output

As shown in Image 12, we multiply the attention score matrix with the Value component, resulting in a matrix of size (9, 1024). This is the output of the self-attention layer.

The process of multiplying the attention score with the Values is crucial, as it combines the information about how the words are related to each other based on attention score. This information is then fed into a feedforward neural network, which further processes the output to generate the final representation of the input sequence.

Multi-Head Attention 👻 👻

Image 13: Source β€” Attention is all you need, Research Paper β†’ Multi-head attention formula
Image 14: Created by the author β†’ Q, K, V for multi-head attention

As I mentioned earlier, attention has three components: Query, Key, and Value. But in multi-head attention, we don’t just stop at one attention mechanism. Instead, we create multiple attention mechanisms, eight in our case. This means we have eight Query, Key, and Value tensors/vectors, each with a size of (9, 128) as shown in image 14.

Image Linear: Linear function (Neural Network ). Source: pytorch.org

Image Linear Shows the linear transformation in a Neural Network. x is the input matrix ( Query, key, Value, or output/input tensor in our case), and A is the weight matrix ( In our case W1, W2, W3). b is the bias matrix.

You might have noticed that the size has been reduced from 1024 to 128. That’s because we’re using eight heads, and we divide the model dimension (1024) by the number of heads (8) to get 128.

So, how are these tensors created? We multiply the input with eight weight matrices (Parameter matrices ), one for each head, to create the Query, Key, and Value vectors. For example, to create the eight Query tensors, we multiply the input with eight weight query matrices, each with a size of (1024, 128), resulting in eight Query tensors with a size of (9, 128).

These weight matrices are learnable parameters that are updated based on the loss, which helps the model capture different aspects of the input sequence in multi-head attention.

Image 15: Created by the author β†’ Multi-head Q, K, V

Now, you might be wondering what happens in each of these eight heads. Well, it’s similar to the self-attention mechanism we discussed earlier. In each head, we multiply the Query and transposed Key vectors, divide by the square root of the Key dimension, and then apply the softmax function to get the attention scores. These attention scores are then multiplied with the Value vector in each head.

This process gives us eight attention outputs (Image 16), each capturing different aspects of the input sequence.

Image 16: Created by the author

Concatenating Multi-Head Attention Outputs 🚧

Image 17: Created by the author

Now that we have eight attention output tensors, each capturing different aspects of the input sequence, we concatenate them as shown in Image 13 and Image 17. This results in a single output tensor with a size of (9, 1024), which is the same as the self-attention output and Model Input. However, this multi-head attention output is much richer in information, as it combines the insights from all eight attention heads.

The concatenated output is then multiplied with a weight matrix W_O of size (1024, 1024), as shown in Image 17. This final linear transformation produces the attention layer output, which still has a size of (9, 1024). This output now provides a more comprehensive understanding of the input sequence.

Entering the Feed Forward Network 🐾

Now that we’ve explored the attention mechanism, it’s time to dive into the feed-forward network (FFN). The FFN consists of one hidden layer and one output layer (Image 18).

Image 18: Created by the author β€” FFN

As I mentioned earlier, the hidden size of the FFN layer is 4096. This feed-forward network is essentially a neural network, and Image 19 illustrates the operations involved.

Image 19: Attention is all you Need Research Paperβ€” FFN operation

So, why do we need a feed-forward network layer? The attention layer helps capture the semantic and syntactic relationships between words in the sequence, but the feed-forward network layer learns to represent these relationships in a way that enables the large language model to generate text during inference.

Image 20: Created by the author β€” FFN Hidden layer operation

Here’s what happens: the FFN input with a size of (9, 1024) is multiplied with a weight matrix W1 of size (4096, 1024), resulting in an output size of (9, 4096). A bias vector b1 of size (1, 4096) is added (broadcasted to each row), and the ReLU activation function is applied to introduce non-linearity in the neural network (Image 20).

If you’re not familiar with this operation, I again recommend you to check out my article Neural Networks: Fresh perspective 😉. for better understanding.

Image 21: Created by the author β€” FFN Output layer operation

The hidden layer output size of (9, 4096) is then multiplied with a weight matrix W2 of size (1024, 4096), resulting in an output size of (9, 1024). Finally, a bias matrix b2 of size (1, 1024) is added, resulting in the FFN layer output with a size of (9, 1024) as shown in Image 21.

And that’s it! The final output of the feed-forward network is a tensor with a size of (9, 1024). This is the output of Decoder Layer 1.

Note that the same operation will be repeated for all L layers in the decoder. Each layer will have its own multi-head attention mechanism, followed by a feed-forward network, and the output of each layer will be used as the input to the next layer.

Skip Connection and RMS Norm 👴

Before we wrap up our discussion on the decoder layer, I want to quickly mention two important concepts: skip connections and RMS norm.

Why do we use skip connections? In simple terms, skip connections help us maintain the input information throughout the layer without changing too much. This is done by adding the input to its output, ensuring that the original information is preserved. For example, the attention layer input and output are added and given to the FFN input.

Image 22: Source β€” RMS Norm, Research Paper

Image 22 β€” RMS Norm, gi β†’ a learnable parameter.

Root Mean Square Normalization (RMS Norm)[6] is a technique used to normalize vectors while preserving the vector's direction and controlling the vector’s magnitude. This normalization can help stabilize and improve the training of large language models.

With that, we’ve covered the basics of one decoder layer in a transformer-based large language model!

Let’s do some math to calculate the number of parameters in our one decoder layer.

Multi-Head Attention Parameters
We have 3 weight matrices (Query, Key, Value) with 8 heads each, and each matrix has a dimension of 1024 x 128.
= 3 x 8 x 1024 x 128
= 3,145,728

One Linear Weight Matrix W_O in multi-head attention
We also have one linear weight matrix W_O with a dimension of 1024 x 1024.
= 1,048,576

The hidden layer(FFN) has a weight matrix with a dimension of 4096 x 1024, and a bias matrix with 4096 parameters.
= (4096 x 1024) + 4096
= 4,198,400

The output layer(FFN) has a weight matrix with a dimension of 1024 x 4096, and a bias matrix with 1024 parameters.
= (1024 x 4096) + 1024
= 4,195,328

Total Parameters in One Decoder Layer
Adding up all the parameters, we get:
= 3,145,728 + 1,048,576 + 4,198,400 + 4,195,328
= 12,588,032 (approximately 12.5 million parameters)

That’s a lot of parameters in one decoder layer! 😲 And we haven’t even counted the parameters in the embedding model and final linear layer. This is why large language models can have billions of parameters, like 8 billion, 70 billion, etc.

Linear layer and softmax function 🌳

Now that we’ve got the basics covered, let’s dive into how inference works and explore the roles of linear layer and softmax function.

To start, the linear layer (aka the linear projection layer) takes the output from the last decoder layer and projects it onto a new space. In our case, the last decoder layer’s output is a matrix of size (9, 1024), which is then multiplied by the weight matrix of the linear layer, sized (13000, 1024). This results in an output of size (9, 13000), where 13000 represents the vocabulary size.

Image 23: Created by the author β†’ Linear layer
Image 24: Created by the author β†’ Inference

To simplify things as shown in Image 23 and Image 24, let’s focus on just one row of the input, say (1, 1024). This row contains information about the current token, β€œartificial”. When we multiply this row with the weight matrix, we get a set of logits sized (1, 13000).

The softmax function is then applied to these logits to determine the most probable next token. In our example, the predicted token is β€œIntelligence”. The loss is also calculated based on this process.

During inference, this process is repeated for each token in the sequence. The first two words, artificial and Intelligence, are fed through all the layers again to predict the third word. Then, these first three words are fed through the layers to predict the fourth word, and so on, until we reach the end of the sequence.

Each token is generated based on the previously generated tokens and context. The virtual tokens [SOS](input) and [EOS](output) help during inference to identify the beginning and end of the sequence (Image 25).

Image 25: Created by the author

Note : The weights and biases in the Multi-Head attention, FFN, and Linear layer are learnable parameters, meaning they get updated during the backpropagation. During training, the model adjusts these weights and biases to minimize the error and improve llm performance. It’s all neuron operation , I just represent it in a different way.

During training, the virtual tokens [SOS] and [EOS] in image 25 are not universally necessary for all types of LLM training. This is because the model learns from the context of the training data and calculates its loss based on the sequence itself. It doesn’t need explicit markers to show where a sentence starts and ends. But, during inference, these tokens help indicate the start and end of the input sequence.

I know it can be tough to wrap your head around, so I’m trying to explain it in a way that’s easy to understand.

I believe I have made some sense of the Large Language model architecture and functions. If you found my article useful 👍, give it a👏! Feel free to follow for more insights. If you don’t understand, take some time to read again, It will definitely Make sense.

Let’s also stay in touch on 🔗LinkedIn🌏❤️to keep the conversation going!

References :

[1]. Ashish Vaswani, Noam Shazeer, Niki Parmar, Attention is All You Need Research Paper (2017)

[2]. Jay Alammar, Transformer illustration (2018)

[3]. Weight & Biases ML Article, Overview of LLM (2023)

[4]. Krish Naik, Transformer Architecture Explanation YouTube Video (2020)

[5]. LLAMA-2 Explanation (2024)

[6]. Biao Zhang, Rico Sennrich, Root Mean Square Layer Normalization (2019) Research Paper

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓