Transformer Architecture Part -1

Last Updated on September 18, 2024 by Editorial Team

Author(s): Sachinsoni

Originally published on Towards AI.

In recent years, transformers have revolutionized the world of deep learning, powering everything from language models to vision tasks. If you’ve followed my previous blogs, you’re already familiar with some of the key components like self-attention, multi-head attention, layer normalization, and positional encoding. These building blocks form the core of how transformers excel at handling sequential data. In this blog, I’ll tie everything together and take you on a deeper dive into the complete architecture, showing how these components work in harmony to create models that outperform traditional neural networks.

Before we dive into the details of transformer architecture, I want to extend my heartfelt gratitude to my mentor, Nitish Sir. His exceptional guidance and teachings on the CampusX YouTube channel have been instrumental in shaping my understanding of this complex topic. With his support, I’ve embarked on this journey of exploration and learning, and I’m excited to share my insights with you all. Thank you, Nitish Sir, for being an inspiration and mentor!

Let’s start the journey of understanding the Transformer architecture from its core. The diagram often used to represent the Transformer can seem overwhelming at first glance. It includes both the encoder and decoder, but when you break it down, it becomes much easier to understand.

At a high level, the Transformer consists of two main components: the encoder and the decoder. Think of the Transformer as a large box containing two smaller boxes — one for the encoder and one for the decoder. This is the most simplified view of the architecture.

But there’s a bit more complexity here. If you look closely at the diagram, you’ll notice that each of these boxes isn’t just a single block. The encoder and decoder are actually composed of multiple blocks — six encoder blocks and six decoder blocks, according to the original paper Attention Is All You Need. This number was achieved through experimentation, giving the best results for various tasks.

Now, here’s the key: all the encoder blocks are identical, and so are the decoder blocks. This means that once you understand the structure of one encoder block, you understand them all! So, the next logical step is to focus on understanding a single encoder block in detail, as the rest will follow from there.

Now, let’s dive into a single encoder block. When we zoom in, we see that each encoder block consists of two main components: a self-attention block and a feed-forward neural network. If you’ve read my previous blogs, you should already be familiar with self-attention and feed-forward neural networks.

So, what’s inside every encoder block? It’s simple: each block has a self-attention module and a feed-forward neural network. This is true for all six encoder blocks — they are identical, meaning once you understand one, you understand them all.

But how do these blocks work together? The actual architecture of an encoder block includes additional components like add & norm layers and residual connections. These ensure the flow of information remains smooth as it passes through each block.

The input data, typically a batch of sentences, enters the first encoder block, undergoes processing, and the output moves to the next encoder block. This process continues across all six encoder blocks, with the final output being passed to the decoder. Each block processes the data similarly, making the entire architecture highly efficient and structured.

**Flow of data : output of one encoder is input for next encoder block**

Before diving into the main parts of the encoder, it’s crucial to understand the input block, where three essential operations are performed. These steps take place before the input is fed into the encoder.

Tokenization: The first operation is tokenization. If you’re familiar with NLP, you’ll know that tokenization is the process of splitting a sentence into tokens. In this case, we are performing word-level tokenization, where each word in the sentence is broken down into individual tokens. For instance, if our sentence is “How are you?”, it gets tokenized into “How”, “are”, and “you”.
Text Vectorization (Embedding): After tokenization, the words are passed through a process called text vectorization, where each word is converted into a numerical vector. This is essential because machines can’t process raw text — they need numerical representations. We use word embeddings to map each word to a vector. In our case, every word is represented as a 512-dimensional vector. For example, “How” becomes a vector of 512 numbers, “are” gets its own vector, and “you” gets another.
Positional Encoding: Even though we have vectorized the words, there’s a problem: we don’t know the order of the words. Knowing the sequence of words is vital in understanding the context, as the position of each word in the sentence impacts its meaning. This is where positional encoding comes in.

Positional encoding generates a 512-dimensional vector for each word’s position in the sentence. For instance, the first word (“How”) gets a positional vector, the second word (“are”) gets another positional vector, and the third word (“you”) gets yet another. Each positional vector has the same dimensionality as the word embedding (512 dimensions).

This image illustrates how a raw input sentence is transformed into the format required by the encoder block.

Finally, we add these positional vectors to the corresponding word embedding vectors. So, the word embedding for “How” is added to its positional vector, and similarly for “are” and “you.” After this addition, we get new vectors — let’s call them X1, X2, and X3 — which represent the position-aware embeddings for each word in the sentence.

Once we have the positional encodings combined with the input embeddings, the next step is to pass these vectors through the first encoder block. In this section, we’ll focus on two key operations happening in the encoder: Multi-head Attention and Normalization.

Multi-head Attention :

At the core of the transformer architecture is the multi-head attention mechanism, which is applied to the input vectors. As a reminder, the input vectors are still of 512 dimensions each.

The input vectors are initially fed into the multi-head attention block, which is created by combining multiple self-attention mechanisms. Self-attention allows the model to understand contextual relationships between words by focusing on other words in the sentence when generating a vector for a particular word.

For instance, in a sentence like:

“The bank approved the loan.”
“He sat by the river bank.”

The word “bank” is used in different contexts in these two sentences. Initially, the embedding vectors for “bank” would be the same, but self-attention adjusts these vectors based on the surrounding words. In the first sentence, “bank” refers to a financial institution, while in the second, it refers to the side of a river. Self-attention ensures that the model can distinguish between these two meanings.

Now, instead of relying on just one self-attention mechanism, multi-head attention runs multiple self-attention operations in parallel. This allows the model to focus on different aspects of the sentence simultaneously, creating a more diverse and context-aware representation of the input.

So, when the multi-head attention block processes the first word (let’s call it X1), it outputs a new vector (Z1), which is still 512 dimensions but now contextually enriched. Similarly, when the second word X2 (e.g., “are”) and third word X3 (e.g., “you”) are processed, they produce Z2 and Z3, respectively.

An important detail is that throughout this process, the dimensionality remains consistent at 512 dimensions.

Residual Connection and Addition :

Once we get the output vectors Z1, Z2, Z3 from the multi-head attention block, we move on to the next part: the add and normalize step.

At this stage, we introduce a residual connection. The idea behind a residual connection is to bypass the multi-head attention output and carry the original input vectors X1, X2, X3 forward. These input vectors are added to their corresponding multi-head attention outputs. So, for each word, we add its original embedding to its context-aware embedding:

Z1 + X1
Z2 + X2
Z3 + X3

The result is a new set of vectors: Z1’, Z2’, Z3’, each of which is still 512 dimensions but now contains both the original input information and the context from multi-head attention.

Layer Normalization :

For each vector, like Z1’, which contains 512 numbers, we calculate the mean and standard deviation of those numbers. Using these two statistics, we normalize all the 512 values, bringing them into a standardized range. This process is repeated for the other vectors, Z2’ and Z3’, ensuring that all vectors are consistently normalized. Additionally, gamma (γ) and beta (β) parameters are applied during this normalization process, but I’ve covered that in detail in the layer normalization blog.

The result of this operation is a set of normalized vectors:

Z1_norm, Z2_norm, and Z3_norm.

Each of these vectors remains 512 dimensions, but the values are now contained within a smaller, well-defined range.

Why Normalize?

The key question is, why do we need to normalize these vectors?

The answer is straightforward: stabilizing the training process. Without normalization, the output of multi-head attention, such as Z1, Z2, Z3, could exist in any range, as there’s no limit on the values produced by the self-attention mechanism. For instance, since self-attention involves multiplying numbers and performing various mathematical operations, the resulting values can vary widely. This unpredictability can destabilize training because neural networks perform best when the numbers they work with are in a small, consistent range.

By normalizing the vectors, we ensure that they remain in a stable range, which helps with the overall training process. When we add the original input vectors (X1, X2, X3) to the attention outputs (Z1, Z2, Z3), the numbers could become even larger. Hence, layer normalization is crucial to bring them back into a manageable range.

The Role of the Residual Connection :

Another question you might have is: why do we use this residual connection (or skip connection) to add the original inputs back after the multi-head attention block?

The purpose of this addition is to enable a residual connection, which helps in gradient flow during training and allows the model to learn more effectively without vanishing gradients.

Feed Forward Network :

After layer normalization, the normalized vectors, Z1_norm, Z2_norm, and Z3_norm, are passed through a Feed Forward Neural Network (FFNN). Let’s break down its architecture as described in the research paper:

The input layer is not counted as part of the neural network, but it receives the 512-dimensional input vectors.
The feed-forward network consists of two layers:

First layer with 2048 neurons and a ReLU activation function.
Second layer with 512 neurons and a linear activation function.

Weights and Biases in the FFNN :

The weights between the input and the first layer form a 512 × 2048 matrix, represented as W1.
Each of the 2048 neurons in the first layer has its own bias, represented collectively as B1.
The weights between the first and second layer form a 2048 × 512 matrix, represented as W2.
Each of the 512 neurons in the second layer has its own bias, represented collectively as B2.

Processing the Input :

The input vectors Z1_norm, Z2_norm, and Z3_norm can be imagined as stacked together to form a 3 × 512 matrix, where each row corresponds to one vector. This matrix is then fed into the FFNN.

The input matrix is multiplied by the weights W1 and the bias B1 is added.
A ReLU activation is applied to introduce non-linearity.
The output is a 3 × 2048 matrix, representing the expanded dimensionality.
This matrix is multiplied by the weights W2 and bias B2 is added, resulting in a 3 × 512 matrix.

Essentially, the dimensionality of the input vectors is first increased from 512 to 2048, and then reduced back to 512.

Why Increase and Then Reduce Dimensionality?

You might wonder, what’s the benefit of first increasing the dimensionality and then reducing it again? The key benefit comes from the ReLU activation in the first layer, which introduces non-linearities into the model. This allows the FFNN to learn more complex patterns than it could with a simple linear transformation.

Final Output of the FFNN :

The final result is a set of three vectors, each with 512 dimensions, similar to the input. Let’s call these vectors Y1, Y2, and Y3.

Add & Normalize :

After the feed-forward network processes the input, we obtain three vectors, Y1, Y2, and Y3, each with a dimensionality of 512. These correspond to the output of the feed-forward network.

Now, we perform an add operation. The original input vectors Z1_norm, Z2_norm, and Z3_norm are bypassed and added to the output vectors Y1, Y2, and Y3, respectively. This results in a new set of vectors, which we’ll call Y1', Y2', and Y3'. All these vectors are still 512-dimensional.

The purpose of this addition is to enable a residual connection, which helps in gradient flow during training and allows the model to learn more effectively without vanishing gradients.

Layer Normalization :

After the addition, layer normalization is applied again, just like we did earlier in the transformer block. Each of the vectors Y1', Y2', and Y3' undergoes normalization to ensure that the values are scaled properly, stabilizing the learning process. The resulting vectors are Y1_norm, Y2_norm, and Y3_norm.

Next Encoder Block :

These normalized vectors Y1_norm, Y2_norm, and Y3_norm are then passed as inputs to the next encoder block. This is similar to how the original input vectors X1, X2, and X3 were fed into the first encoder block.

In the next encoder block, the same operations will occur:

Multi-head attention will be applied.
Add & normalize will follow.
The output will then be processed through another feed-forward network.
Again, we’ll have an add & normalize step before passing the vectors to the next encoder block.

This process is repeated across a total of six encoder blocks, after which the output is passed to the decoder portion of the transformer. We’ll cover the decoder architecture in upcoming blogs. I hope you now have a clear understanding of the transformer’s encoder architecture. I am showing a brief internal structure of a encoder :

The internal structure of a encoder of a Transformer

Important Note:

1. Unique Parameters in Each Encoder Block :

One key point to remember is that while the architecture of each encoder block remains the same, the parameters (such as the weights and biases in the attention and feed-forward layers) are unique to each encoder block. Each encoder block has its own set of learned parameters that are adjusted independently during backpropagation.

2. Why Use Feed-Forward Neural Networks (FFNs)?

When you look at the workings of multi-head attention, you’ll notice that all operations — such as computing the dot products between queries, keys, and values — are linear. This is great for capturing contextual embeddings, but sometimes the data may have non-linear complexities that can’t be fully captured by linear transformations alone.

This is where the feed-forward neural network comes into play. By using an activation function like ReLU, the FFN introduces non-linearity, allowing the model to better handle more complex data patterns.

Even though this is the general understanding, it is important to note that the exact role of FFNs in transformers remains a bit of a gray area. As of now, research is still ongoing, and new insights are emerging. One interesting paper I came across suggests that feed-forward layers in transformers act as key-value memory storage. This paper highlights how FFNs might play a more important role than we currently understand.

References :

Research Paper : Attention is all you need

Youtube Video : https://youtu.be/Vs87qcdm8l0?si=aO-EAnqjwytHm14h

I trust this blog has enriched your understanding of Transformer encoder architecture. If you found value in this content, I invite you to stay connected for more insightful posts. Your time and interest are greatly appreciated. Thank you for reading!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Transformer Architecture Part -1

Author(s): Sachinsoni

Multi-head Attention :

Residual Connection and Addition :

Layer Normalization :

Why Normalize?

The Role of the Residual Connection :

Feed Forward Network :

Weights and Biases in the FFNN :

Processing the Input :

Why Increase and Then Reduce Dimensionality?

Final Output of the FFNN :

Add & Normalize :

Layer Normalization :

Next Encoder Block :

Important Note:

1. Unique Parameters in Each Encoder Block :

2. Why Use Feed-Forward Neural Networks (FFNs)?

References :

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Transformer Architecture Part -1

Author(s): Sachinsoni

Multi-head Attention :

Residual Connection and Addition :

Layer Normalization :

Why Normalize?

The Role of the Residual Connection :

Feed Forward Network :

Weights and Biases in the FFNN :

Processing the Input :

Why Increase and Then Reduce Dimensionality?

Final Output of the FFNN :

Add & Normalize :

Layer Normalization :

Next Encoder Block :

Important Note:

1. Unique Parameters in Each Encoder Block :

2. Why Use Feed-Forward Neural Networks (FFNs)?

References :

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement