Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

Build your own Large Language Model (LLM) From Scratch Using PyTorch
Artificial Intelligence   Latest   Machine Learning

Build your own Large Language Model (LLM) From Scratch Using PyTorch

Last Updated on June 11, 2024 by Editorial Team

Author(s): Milan Tamang

Originally published on Towards AI.

A Step-by-Step guide to build and train an LLM named MalayGPT. This model’s task is to translate texts from English to Malay language.

What will you achieve by the end of this post? You will be able to build and train a Large Language Model (LLM) by yourself while coding along with me. Although we’re building an LLM that translates any given text from English to Malay language, You can easily modify this LLM architecture for other language translation tasks.

LLM is the core foundation of the most popular AI chatbots such as ChatGPT, Gemini, MetaAI, Mistral AI, etc. At the very core of every LLM, there is an architecture called Transformer. So, we’ll first build transformer architecture based on the famous paper “Attention is all you need “ –https://arxiv.org/abs/1706.03762.

Transformer Architecture from paper “Attention is all you need”

First, we’ll build all the components of the transformer model block by block. Then, we’ll assemble all the blocks to build our model. After that, we’ll then train and validate our model with the dataset that we’re going to get from the Hugging Face dataset. Finally, we’ll test our model by performing translation on new translation text data.

Important note: I’ll go step by step to code all the components in transformer architecture and provide the necessary explanation about concept of what, why, and how. I would also be providing comments on line-by-line codes that I feel require an explanation. This way I believe you can connect with the overall workflow while coding yourself.

Let’s code together!

Step 1: Load dataset

For an LLM model to be able to do translation from English to Malay task, we’ll need to use a dataset that has both source (English) and target (Malay) language pair. So, we’ll use a dataset from Huggingface called “Helsinki-NLP/opus-100”. It has 1 million pairs of english-malay training datasets which is more than sufficient to get good accuracy and 2000 data each in validation and test datasets. It already comes pre-split so we don’t have to do dataset splitting again.

# Import necessary libraries
# Install datasets, tokenizers library if you've not done so yet (!pip install datasets, tokenizers).
import os
import math
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from pathlib import Path
from datasets import load_dataset
from tqdm import tqdm

# Assign device value as "cuda" to train on GPU if GPU is available. Otherwise it will fall back to default as "cpu".
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Loading train, validation, test dataset from huggingface path below.
raw_train_dataset = load_dataset("Helsinki-NLP/opus-100", "en-ms", split='train')
raw_validation_dataset = load_dataset("Helsinki-NLP/opus-100", "en-ms", split='validation')
raw_test_dataset = load_dataset("Helsinki-NLP/opus-100", "en-ms", split='test')

# Directory to store dataset files.
os.mkdir("./dataset-en")
os.mkdir("./dataset-my")

# Directory to save model during model training after each EPOCHS (in step 10).
os.mkdir("./malaygpt")

# Director to store source and target tokenizer.
os.mkdir("./tokenizer_en")
os.mkdir("./tokenizer_my")

dataset_en = []
dataset_my = []
file_count = 1

# In order to train the tokenizer (in step 2), we'll separate the training dataset into english and malay.
# Create multiple small file of size 50k data each and store into dataset-en and dataset-my directory.
for data in tqdm(raw_train_dataset["translation"]):
dataset_en.append(data["en"].replace('\n', " "))
dataset_my.append(data["ms"].replace('\n', " "))
if len(dataset_en) == 50000:
with open(f'./dataset-en/file{file_count}.txt', 'w', encoding='utf-8') as fp:
fp.write('\n'.join(dataset_en))
dataset_en = []

with open(f'./dataset-my/file{file_count}.txt', 'w', encoding='utf-8') as fp:
fp.write('\n'.join(dataset_my))
dataset_my = []
file_count += 1

Step 2: Create Tokenizer

The transformer model doesn’t process raw text, it only processes numbers. Hence, we’ll have to do something to convert the raw text into numbers. For that, we’re going to use a popular tokenizer called BPE tokenizer which is a subword tokenizer that is being used in models like GPT3. We’ll first train the BPE tokenizer on the corpus data (training dataset in our case) which we’ve prepared in step 1. The flow goes like the diagram below.

Tokenizer flow

After the training is completed, tokenizer generates a vocabulary for both English and Malay language. Vocabulary is a collection of unique tokens from the corpus data. Since we’re performing a translation task, we will require tokenizer for both languages. The BPE tokenizer takes a raw text, maps it with the tokens in vocabulary, and returns a token for each word in the input raw text. The tokens can be a single word or sub-word. This is one of the advantage of sub-word tokenizer over other tokenizer because it can overcome the OOV (out of vocabulary) problem. The tokenizer then returns the unique index or position ID of the token in vocabulary which will be further used to create embeddings as show in the flow above.

# import tokenzier library classes and modules.
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

# path to the training dataset files which will be used to train tokenizer.
path_en = [str(file) for file in Path('./dataset-en').glob("**/*.txt")]
path_my = [str(file) for file in Path('./dataset-my').glob("**/*.txt")]

# [ Creating Source Language Tokenizer - English ].
# Additional special tokens are created such as [UNK] - to represent Unknown words, [PAD] - Padding token to maintain same sequence length across the model.
# [CLS] - token to denote start of sentence, [SEP] - token to denote end of sentence.
tokenizer_en = Tokenizer(BPE(unk_token="[UNK]"))
trainer_en = BpeTrainer(min_frequency=2, special_tokens=["[PAD]","[UNK]","[CLS]", "[SEP]", "[MASK]"])

# splitting tokens based on whitespace.
tokenizer_en.pre_tokenizer = Whitespace()

# Tokenizer trains the dataset files created in step 1
tokenizer_en.train(files=path_en, trainer=trainer_en)

# Save tokenizer for future use.
tokenizer_en.save("./tokenizer_en/tokenizer_en.json")

# [ Creating Target Language Tokenizer - Malay ].
tokenizer_my = Tokenizer(BPE(unk_token="[UNK]"))
trainer_my = BpeTrainer(min_frequency=2, special_tokens=["[PAD]","[UNK]","[CLS]", "[SEP]", "[MASK]"])
tokenizer_my.pre_tokenizer = Whitespace()
tokenizer_my.train(files=path_my, trainer=trainer_my)
tokenizer_my.save("./tokenizer_my/tokenizer_my.json")

tokenizer_en = Tokenizer.from_file("./tokenizer_en/tokenizer_en.json")
tokenizer_my = Tokenizer.from_file("./tokenizer_my/tokenizer_my.json")

# Getting size of both tokenizer.
source_vocab_size = tokenizer_en.get_vocab_size()
target_vocab_size = tokenizer_my.get_vocab_size()

# Define token-ids variables, we need this for training model.
CLS_ID = torch.tensor([tokenizer_my.token_to_id("[CLS]")], dtype=torch.int64).to(device)
SEP_ID = torch.tensor([tokenizer_my.token_to_id("[SEP]")], dtype=torch.int64).to(device)
PAD_ID = torch.tensor([tokenizer_my.token_to_id("[PAD]")], dtype=torch.int64).to(device)

Step 3: Prepare Dataset and DataLoader

In this step, we are going to prepare dataset for both source and target language which will be used later to train and validate the model that we’ll be building. We’ll create a class that takes in the raw dataset, and define a function that encodes both source and target text separately using the source (tokenizer_en) and target (tokenizer_my) tokenizer. Finally, we’ll create a DataLoader for the train and validation dataset which iterates over dataset in batches (in our example, the batch size would be set to 10). Batch size can be changed based on the size of data and available processing power.

# This class takes raw dataset and max_seq_len (maximum length of a sequence in the entire dataset).
class EncodeDataset(Dataset):
def __init__(self, raw_dataset, max_seq_len):
super().__init__()
self.raw_dataset = raw_dataset
self.max_seq_len = max_seq_len

def __len__(self):
return len(self.raw_dataset)

def __getitem__(self, index):

# Fetching raw text for the given index that consists of source and target pair.
raw_text = self.raw_dataset[index]

# Separating text to source and target text and will be later used for encoding.
source_text = raw_text["en"]
target_text = raw_text["ms"]

# Encoding source text with source tokenizer(tokenizer_en) and target text with target tokenizer(tokenizer_my).
source_text_encoded = torch.tensor(tokenizer_en.encode(source_text).ids, dtype = torch.int64).to(device)
target_text_encoded = torch.tensor(tokenizer_my.encode(target_text).ids, dtype = torch.int64).to(device)

# To train the model, the sequence lenth of each input sequence should be equal max seq length.
# Hence additional number of padding will be added to the input sequence if the length is less than the max_seq_len.
num_source_padding = self.max_seq_len - len(source_text_encoded) - 2
num_target_padding = self.max_seq_len - len(target_text_encoded) - 1

encoder_padding = torch.tensor([PAD_ID] * num_source_padding, dtype = torch.int64).to(device)
decoder_padding = torch.tensor([PAD_ID] * num_target_padding, dtype = torch.int64).to(device)

# encoder_input has the first token as start of sentence - CLS_ID, followed by source encoding which is then followed by the end of sentence token - SEP.
# To reach the required max_seq_len, addition PAD token will be added at the end.
encoder_input = torch.cat([CLS_ID, source_text_encoded, SEP_ID, encoder_padding]).to(device)

# decoder_input has the first token as start of sentence - CLS_ID, followed by target encoding.
# To reach the required max_seq_len, addition PAD token will be added at the end. There is no end of sentence token - SEP in decoder_input.
decoder_input = torch.cat([CLS_ID, target_text_encoded, decoder_padding ]).to(device)

# target_label has the first token as target encoding followed by end of sentence token - SEP. There is no start of sentence token - CLS in target label.
# To reach the required max_seq_len, addition PAD token will be added at the end.
target_label = torch.cat([target_text_encoded,SEP_ID,decoder_padding]).to(device)

# As we've added extra padding token with input encoding, during training, we don't want this token to be trained by model as there is nothing to learn in this token.
# So, we'll use encoder mask to nullify the padding token value prior to calculating output of self attention in encoder block.
encoder_mask = (encoder_input != PAD_ID).unsqueeze(0).unsqueeze(0).int().to(device)

# We also don't want any token to get influenced by the future token during the decoding stage. Hence, Causal mask is being implemented during masked multihead attention to handle this.
decoder_mask = (decoder_input != PAD_ID).unsqueeze(0).unsqueeze(0).int() & causal_mask(decoder_input.size(0)).to(device)

return {
'encoder_input': encoder_input,
'decoder_input': decoder_input,
'target_label': target_label,
'encoder_mask': encoder_mask,
'decoder_mask': decoder_mask,
'source_text': source_text,
'target_text': target_text
}

# Causal mask will make sure any token that comes after the current token will be masked, meaning the value will be replaced by -ve infinity which will be converted to zero or close to zero after softmax function.
# Hence the model will just ignore these value or willn't be able to learn anything from these values.
def causal_mask(size):
# dimension of causal mask (batch_size, seq_len, seq_len)
mask = torch.triu(torch.ones(1, size, size), diagonal = 1).type(torch.int)
return mask == 0

# To calculate the max sequence lenth in the entire training dataset for the source and target dataset.
max_seq_len_source = 0
max_seq_len_target = 0

for data in raw_train_dataset["translation"]:
enc_ids = tokenizer_en.encode(data["en"]).ids
dec_ids = tokenizer_my.encode(data["ms"]).ids
max_seq_len_source = max(max_seq_len_source, len(enc_ids))
max_seq_len_target = max(max_seq_len_target, len(dec_ids))

print(f'max_seqlen_source: {max_seq_len_source}') #530
print(f'max_seqlen_target: {max_seq_len_target}') #526

# To simplify the training process, we'll just take single max_seq_len and add 20 to cover the additional length of tokens such as PAD, CLS, SEP in the sequence.
max_seq_len = 550

# Instantiate the EncodeRawDataset class and create the encoded train and validation-dataset.
train_dataset = EncodeDataset(raw_train_dataset["translation"], max_seq_len)
val_dataset = EncodeDataset(raw_validation_dataset["translation"], max_seq_len)

# Creating DataLoader wrapper for both training and validation dataset. This dataloader will be used later stage during training and validation of our LLM model.
train_dataloader = DataLoader(train_dataset, batch_size = 10, shuffle = True, generator=torch.Generator(device='cuda'))
val_dataloader = DataLoader(val_dataset, batch_size = 1, shuffle = True, generator=torch.Generator(device='cuda'))

Step 4: Input Embedding and Positional Encoding

Input Embedding: The sequence of token IDs generated from tokenizers in step 2 will be fed into the embedding layer. The embedding layer maps the token-id to vocabulary and generates a embedding vector of dimension 512 for each token. [Dimension 512 is taken from the attention paper ]. The embedding vector can capture the semantic meaning of the token based on the training dataset it has been trained on. Each dimension value inside the embedding vector represents some kind of feature related to the token. For example, if the token is Dog, some dimension value would represents eyes, mouth, leg, height, etc. If we draw a vector in n-dimensional space, similar-looking objects such as dogs, and cats would be located near to each other, and non-similar-looking objects such as schools, home embedding vectors would be located much farther away.

Positional Encoding: One of the advantages of transformer architecture is that it can process any number of input sequences in parallel which reduces a lot of training time and also makes prediction much faster. However, one drawback is that while processing many sequences of tokens in parallel, the position of tokens in a sentence will not be in order. This could potentially result in different meanings or contexts of sentences which depending on where the tokens are positioned. Hence, to resolve this issue, the attention paper implements the Positional Encoding method. The paper has suggested applying two mathematical function (one is sin and one is cosine) on an index level of each token's 512 dimension. Below is the simple sin and cosine mathematical function.

Sin function is applied to each even dimension value whereas the Cosine function is applied to the odd dimension value of the embedding vector. Finally, the resulting positional encoder vector will be added to the embedding vector. Now, we have the embedding vector which can capture the semantic meaning of the tokens as well as the position of the tokens. Please take note that the value of position encoding remains the same in every sequence.

# Input embedding and positional encoding
class EmbeddingLayer(nn.Module):
def __init__(self, vocab_size: int, d_model: int):
super().__init__()
self.d_model = d_model

# Using pytorch embedding layer module to map token id to vocabulary and then convert into embeeding vector.
# The vocab_size is the vocabulary size of the training dataset created by tokenizer during training of corpus dataset in step 2.
self.embedding = nn.Embedding(vocab_size, d_model)

def forward(self, input):
# In addition of feeding input sequence to the embedding layer, the extra multiplication by square root of d_model is done to normalize the embedding layer output
embedding_output = self.embedding(input) * math.sqrt(self.d_model)
return embedding_output


class PositionalEncoding(nn.Module):
def __init__(self, max_seq_len: int, d_model: int, dropout_rate: float):
super().__init__()
self.dropout = nn.Dropout(dropout_rate)

# We're creating a matrix of the same shape as embedding vector.
pe = torch.zeros(max_seq_len, d_model)

# Calculate the position part of PE functions.
pos = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)

# Calculate the division part of PE functions. Take note that the div part expression is slightly different that papers expression as this exponential functions seems to works better.
div_term = torch.exp(torch.arange(0, d_model, 2).float()) * (-math.log(10000)/d_model)

# Fill in the odd and even matrix value with the sin and cosine mathematical function results.
pe[:, 0::2] = torch.sin(pos * div_term)
pe[:, 1::2] = torch.cos(pos * div_term)

# Since we're expecting the input sequences in batches so the extra batch_size dimension is added in 0 postion.
pe = pe.unsqueeze(0)

def forward(self, input_embdding):
# Add positional encoding together with the input embedding vector.
input_embdding = input_embdding + (self.pe[:, :input_embdding.shape[1], :]).requires_grad_(False)

# Perform dropout to prevent overfitting.
return self.dropout(input_embdding)

Step 5: Multi-Head Attention Block

Just like the Transformer is the heart of LLM, the self-attention mechanism is the heart of Transformer architecture.

So why you do need self-attention? let’s answer this with a simple example below.

In sentence 1 and sentence 2, the word “bank ” clearly has two different meanings. However, the embedding value of the word “bank ” is the same in both sentences. This is not the right thing. We want the embedding value to be changed based on the context of the sentence. Hence, we need a mechanism where the embedding value can dynamically change to give the contextual meaning based on the overall meaning of the sentence. Self-attention mechanism can dynamically update the value of embedding that can represent the contextual meaning based on the sentence.

If self-attention is already so good, why do we need Multi-Head Self-attention? Let’s look at another example below to find out the answer.

In this example, if we use self-attention which might focus only in one aspect of the sentence, maybe just a “what” aspect as in it could only capture “What did John do?”. However, the other aspects such as “when” or “where”, are as equally important to learn for the model to perform better. So, we will need to find a way for the Self-Attention mechanism to learn those multiple relationships in a sentences at once. Hence, this is where Multi-Head Self Attention (Multi-Head Attention can be used interchangeably) comes in and helps. In Multi-Head attention, the single-head embeddings are going to divide into multiple heads so that each head will look into different aspects of the sentences and learn accordingly. Which is what we want.

Now, we know why need Multi-Head attention. Let’s look at how the part. How does Multi-Head attention actually work? Let’s dive right into it.

If you’re comfortable with matrix multiplication, it is a pretty easy task for you to understand the mechanism. Let’s take a look at the entire flow diagram first and I’ll explain the flow from Input to the output of Multi-Head attention in point-wise description below.

Image source: https://github.com/hkproj/transformer from scratch notes

1. At first, let’s make 3 copy of the encoder input (the combination of input embedding and positional encoding, we’ve done that in step 4). Let’s give each of them a name Q, K, and V. Each of them are nothing but just a copy of the encoder input. Encoder input shape: (seq_len, d_model), seq_len: max sequence length, d_model: embedding vector dimension in this case is 512.

2. Next, we’ll perform a matrix multiplication of Q with weight W_q, K with weight W_k, and V with weight W_v. Each weight matrix has shape of (d_model, d_model). The resulting new query, key, and value embedding vector has the shape of (seq_len, d_model). The weight parameters will be initialized randomly by the model and later on, will be updated as model starts training. Why do we need weight matrix multiplication in the first place? because these are learnable parameters which are needed for query, key, and value embedding vectors to give better representation.

3. As per the attention paper, the number of heads would be 8. Each new query, key, and value embedding vector will be divided into 8 smaller units of query, key and value embedding vector. The new shape of the embedding vector is (seq_len, d_model/num_heads) or (seq_len, d_k). [ d_k = d_model/num_heads ].

4. Each query embedding vector will perform the dot product operation with the transpose of key embedding vector of itself and all other embedding vectors in the sequence. This dot product gives attention scores. Attention score shows how similar is the given token to all the other tokens in the given input sequence. The higher the score, the more the similarity.

  • Attention score will then be divided by the square root of the d_k which is required to normalize the score value across the matrix. But why must divide by d_k to normalize, it could be any other number. The main reason is, that as embedding vector dimension increases, the total variance in the attention matrix increases proportionately. That is why dividing by d_k will balance out the increase of variance. If we don’t divide by d_k, for any given higher attention score, the softmax function will give a very high probability value and similarly for any low attention score value, the softmax function will give a very low probability value. This will eventually make models to only focus on learning features with those probability values and ignore features with lower probability values and this will lead to vanishing gradient. Hence, normalizing attention score matrix is very necessary.
  • Before performing the softmax function, if the encoder mask is not None, the attention score will be matrix multiplied by the encoder mask. If the mask is a causal mask, then the attention score value for those embedding tokens that comes after it in the input sequence will be replaced by -ve infinity. The softmax function will convert the -ve infinity value to close to zero value. Hence, the model will not learn those features that come after the current tokens. This is how we can prevent future token from influencing our model learning.

5. The softmax function is then applied to the attention score matrix and outputs a weight matrix of shape (seq_len, seq_len).

6. These weight matrix will then matrix multiply with the corresponding value embeddings vector. This will result in 8 attention heads with the shape (seq_len, d_v). [ d_v = d_model/num_heads ].

7. Finally, all the heads will be concatenated into a single Head with a new shape (seq_len, d_model). This new single head will be matrix multiplied by the output weight matrix, W_o (d_model, d_model). The final output of Multi-Head Attention represents the contextual meaning of the word as well as ability to learn multiple aspects of the input sentence.

With that, let’s start coding the Multi-Head attention block which is much easier and shorter.

class MultiHeadAttention(nn.Module):
def __init__(self, d_model: int, num_heads: int, dropout_rate: float):
super().__init__()
# Define dropout to prevent overfitting.
self.dropout = nn.Dropout(dropout_rate)

# Weight matrix are introduced and are all learnable parameters.
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)

self.num_heads = num_heads
assert d_model % num_heads == 0, "d_model must be divisible by number of heads"

# d_k is the new dimension of each splitted self attention heads
self.d_k = d_model // num_heads

def forward(self, q, k, v, encoder_mask=None):

# We'll be training our model with multiple batches of sequence at once in parallel, hence we'll need to include batch_size in the shape as well.
# query, key and value are calculated by matrix multiplication of corresponding weights with the input embeddings.
# Change of shape: q(batch_size, seq_len, d_model) @ W_q(d_model, d_model) => query(batch_size, seq_len, d_model) [same goes to key and value].
query = self.W_q(q)
key = self.W_k(k)
value = self.W_v(v)

# Splitting query, key and value into number of heads. d_model is splitted in d_k across 8 heads.
# Change of shape: query(batch_size, seq_len, d_model) => query(batch_size, seq_len, num_heads, d_k) -> query(batch_size,num_heads, seq_len,d_k) [same goes to key and value].
query = query.view(query.shape[0], query.shape[1], self.num_heads ,self.d_k).transpose(1,2)
key = key.view(key.shape[0], key.shape[1], self.num_heads ,self.d_k).transpose(1,2)
value = value.view(value.shape[0], value.shape[1], self.num_heads ,self.d_k).transpose(1,2)

# :: SELF ATTENTION BLOCK STARTS ::

# Attention score is calculated to find the similarity or relation between query with key of itself and all other embedding in the sequence.
# Change of shape: query(batch_size,num_heads, seq_len,d_k) @ key(batch_size,num_heads, seq_len,d_k) => attention_score(batch_size,num_heads, seq_len,seq_len).
attention_score = (query @ key.transpose(-2,-1))/math.sqrt(self.d_k)

# If mask is provided, the attention score needs to modify as per the mask value. Refer to the details in point no 4.
if encoder_mask is not None:
attention_score = attention_score.masked_fill(encoder_mask==0, -1e9)

# Softmax function calculates the probability distribution among all the attention scores. It assign higher probabiliy value to higher attention score. Meaning more similar tokens get higher probability value.
# Change of shape: same as attention_score
attention_weight = torch.softmax(attention_score, dim=-1)

if self.dropout is not None:
attention_weight = self.dropout(attention_weight)

# Final step in Self attention block is, matrix multiplication of attention_weight with Value embedding vector.
# Change of shape: attention_score(batch_size,num_heads, seq_len,seq_len) @ value(batch_size,num_heads, seq_len,d_k) => attention_output(batch_size,num_heads, seq_len,d_k)
attention_output = attention_score @ value

# :: SELF ATTENTION BLOCK ENDS ::

# Now, all the heads will be combined back to a single head
# Change of shape:attention_output(batch_size,num_heads, seq_len,d_k) => attention_output(batch_size,seq_len,num_heads,d_k) => attention_output(batch_size,seq_len,d_model)
attention_output = attention_output.transpose(1,2).contiguous().view(attention_output.shape[0], -1, self.num_heads * self.d_k)

# Finally attention_output is matrix multiplied with output weight matrix to give the final Multi-Head attention output.
# The shape of the multihead_output is same as the embedding input
# Change of shape: attention_output(batch_size,seq_len,d_model) @ W_o(d_model, d_model) => multihead_output(batch_size, seq_len, d_model)
multihead_output = self.W_o(attention_output)

return multihead_output

Step 6: Feedforward Network, Layer Normalization and AddAndNorm

Feedfoward Network: The feedfoward network uses a deep neural network to learn all the features of embedding vector across two linear layer (1st has d_model nodes and 2nd has d_ff nodes, value assigned as per the attention paper) and the ReLU activation function is applied to the output of 1st linear layer that provides non-linearity to the embeddings value and dropout is applied to further avoid overfitting.

LayerNorm: We apply layer normalization to the embedding value to ensure the distribution of value across the embedding vector in the network remains consistent. This ensures smooth learning. We’ll be using extra learning parameters called gamma and beta to scale and shift the embedding value as the network needs.

AddAndNorm: This consists of a skip connection and a layered normalization (explained earlier). During the forward pass, the Skip connection ensures that the features in the earlier layer can be still remembered in the later stage to make necessary contributions in calculating output. Similarly, during backward propagation, Skip connection ensures to prevent vanishing gradient by needing to perform one less backpropagation in each stage. AddAndNorm is being used in both encoder(2 times) and decoder block (3 times). It takes input from previous layer, normalizes it first before adding it to the output of the previous layer.

# Feedfoward Network, Layer Normalization and AddAndNorm Block
class FeedForward(nn.Module):
def __init__(self, d_model: int, d_ff: int, dropout_rate: float):
super().__init__()

self.layer_1 = nn.Linear(d_model, d_ff)
self.activation_1 = nn.ReLU()
self.dropout = nn.Dropout(dropout_rate)
self.layer_2 = nn.Linear(d_ff, d_model)

def forward(self, input):
return self.layer_2(self.dropout(self.activation_1(self.layer_1(input))))

class LayerNorm(nn.Module):
def __init__(self, eps: float = 1e-5):
super().__init__()
#Epsilon is a very small value and it plays an important role to prevent potentially division by zero problem.
self.eps = eps

#Extra learning parameters gamma and beta are introduced to scale and shift the embedding value as the network needed.
self.gamma = nn.Parameter(torch.ones(1))
self.beta = nn.Parameter(torch.zeros(1))

def forward(self, input):
mean = input.mean(dim=-1, keepdim=True)
std = input.std(dim=-1, keepdim=True)

return self.gamma * ((input - mean)/(std + self.eps)) + self.beta


class AddAndNorm(nn.Module):
def __init__(self, dropout_rate: float):
super().__init__()
self.dropout = nn.Dropout(dropout_rate)
self.layer_norm = LayerNorm()

def forward(self, input, sub_layer):
return input + self.dropout(sub_layer(self.layer_norm(input)))

Step 7: Encoder block and Encoder

Encoder Block: There are two main components inside the encoder block: Multi-Head Attention and Feedforward. There are also 2 units of Add & Norm. We’ll first assemble all these components in the EncoderBlock class as per the flow in the Attention paper. As per the paper, this encoder block has been repeated 6 times.

Encoder: We’ll then create an additional class called Encoder which will take the list of EncoderBlock and stack it and give a final Encoder output.

class EncoderBlock(nn.Module):
def __init__(self, multihead_attention: MultiHeadAttention, feed_forward: FeedForward, dropout_rate: float):
super().__init__()
self.multihead_attention = multihead_attention
self.feed_forward = feed_forward
self.add_and_norm_list = nn.ModuleList([AddAndNorm(dropout_rate) for _ in range(2)])

def forward(self, encoder_input, encoder_mask):
# First AddAndNorm unit taking encoder input from skip connection and adding it with the output of MultiHead attention block.
encoder_input = self.add_and_norm_list[0](encoder_input, lambda encoder_input: self.multihead_attention(encoder_input, encoder_input, encoder_input, encoder_mask))

# Second AddAndNorm unit taking output of MultiHead attention block from skip connection and adding it with the output of Feedforward layer.
encoder_input = self.add_and_norm_list[1](encoder_input, self.feed_forward)

return encoder_input

class Encoder(nn.Module):
def __init__(self, encoderblocklist: nn.ModuleList):
super().__init__()

# Encoder class is initialized by taking encoderblock list.
self.encoderblocklist = encoderblocklist
self.layer_norm = LayerNorm()

def forward(self, encoder_input, encoder_mask):
# Looping through all the encoder block - 6 times.
for encoderblock in self.encoderblocklist:
encoder_input = encoderblock(encoder_input, encoder_mask)

# Normalize the final encoder block output and return. This encoder output will be used later on as key and value for the cross attention in decoder block.
encoder_output = self.layer_norm(encoder_input)
return encoder_output

Step 8: Decoder block, Decoder and Projection Layer

Decoder Block: There are three main components in the Decoder block: Masked Multi-Head Attention, Multi-Head Attention, and Feedforward. The decoder block also has 3 units of Add & Norm. We’ll assemble all these components in the DecoderBlock class as per the flow in the Attention paper. As per the paper, this decoder block has been repeated for 6 times.

Decoder: We’ll create additional class called Decoder which will take a list of DecoderBlock, stack it, and give a final Decoder output.

There are two type of Multi-head Attention in the decoder block. The first one is Masked Multi-Head attention. It takes in decoder input as query, key, and value and a decoder mask (also known as causal mask). Causal mask prevents the model from looking at embeddings that are ahead in the sequence order. The details explanation of how it works is provided in steps 3 and step 5.

Projection Layer: The final decoder output will be passed into the projection layer. In this layer, the decoder output will be first fed into a linear layer where the shape of embedding will changed as provided in the code section below. Subsequently, the softmax function converts the decoder output to the probability distribution over the vocabulary, and the token with the highest probability is selected as the prediction output.

class DecoderBlock(nn.Module):
def __init__(self, masked_multihead_attention: MultiHeadAttention,multihead_attention: MultiHeadAttention, feed_forward: FeedForward, dropout_rate: float):
super().__init__()
self.masked_multihead_attention = masked_multihead_attention
self.multihead_attention = multihead_attention
self.feed_forward = feed_forward
self.add_and_norm_list = nn.ModuleList([AddAndNorm(dropout_rate) for _ in range(3)])

def forward(self, decoder_input, decoder_mask, encoder_output, encoder_mask):
# First AddAndNorm unit taking decoder input from skip connection and adding it with the output of Masked Multi-Head attention block.
decoder_input = self.add_and_norm_list[0](decoder_input, lambda decoder_input: self.masked_multihead_attention(decoder_input,decoder_input, decoder_input, decoder_mask))
# Second AddAndNorm unit taking output of Masked Multi-Head attention block from skip connection and adding it with the output of MultiHead attention block.
decoder_input = self.add_and_norm_list[1](decoder_input, lambda decoder_input: self.multihead_attention(decoder_input,encoder_output, encoder_output, encoder_mask)) # cross attention
# Third AddAndNorm unit taking output of MultiHead attention block from skip connection and adding it with the output of Feedforward layer.
decoder_input = self.add_and_norm_list[2](decoder_input, self.feed_forward)
return decoder_input

class Decoder(nn.Module):
def __init__(self,decoderblocklist: nn.ModuleList):
super().__init__()
self.decoderblocklist = decoderblocklist
self.layer_norm = LayerNorm()

def forward(self, decoder_input, decoder_mask, encoder_output, encoder_mask):
for decoderblock in self.decoderblocklist:
decoder_input = decoderblock(decoder_input, decoder_mask, encoder_output, encoder_mask)

decoder_output = self.layer_norm(decoder_input)
return decoder_output

class ProjectionLayer(nn.Module):
def __init__(self, vocab_size: int, d_model: int):
super().__init__()
self.projection_layer = nn.Linear(d_model, vocab_size)

def forward(self, decoder_output):
# Projection layer first take in decoder output and passed into the linear layer of shape (d_model, vocab_size)
# Change in shape: decoder_output(batch_size, seq_len, d_model) @ linear_layer(d_model, vocab_size) => output(batch_size, seq_len, vocab_size)
output = self.projection_layer(decoder_output)

# softmax function to output the probability distribution over the vocabulary
return torch.log_softmax(output, dim=-1)

Step 9: Create and build a Transformer

Finally, we’ve completed building all the component blocks in the transformer architecture. The only pending task is to assemble it all together.

First, we create a Transformer class which will initialize all the instances of component classes. Inside the transformer class, we’ll first define encode function that does all the tasks in encoder part of transformer and generates the encoder output.

Second, we define a decode function that does all the tasks in the decoder part of transformer and generates decoder output.

Third, we define a project function, which takes in the decoder output and maps the output to the vocabulary for prediction.

Now, the transformer architecture is ready. We can now build our translation LLM Model, by defining a function which takes in all the necessary parameters as given in the code below.

class Transformer(nn.Module):
def __init__(self, source_embed: EmbeddingLayer, target_embed: EmbeddingLayer, positional_encoding: PositionalEncoding, multihead_attention: MultiHeadAttention, masked_multihead_attention: MultiHeadAttention, feed_forward: FeedForward, encoder: Encoder, decoder: Decoder, projection_layer: ProjectionLayer, dropout_rate: float):
super().__init__()

# Initialize instances of all the component class of transformer architecture.
self.source_embed = source_embed
self.target_embed = target_embed
self.positional_encoding = positional_encoding
self.multihead_attention = multihead_attention
self.masked_multihead_attention = masked_multihead_attention
self.feed_forward = feed_forward
self.encoder = encoder
self.decoder = decoder
self.projection_layer = projection_layer
self.dropout = nn.Dropout(dropout_rate)

# Encode function takes in encoder input, does necessary processing inside all encoder blocks and gives encoder output.
def encode(self, encoder_input, encoder_mask):
encoder_input = self.source_embed(encoder_input)
encoder_input = self.positional_encoding(encoder_input)
encoder_output = self.encoder(encoder_input, encoder_mask)
return encoder_output

# Decode function takes in decoder input, does necessary processing inside all decoder blocks and gives decoder output.
def decode(self, decoder_input, decoder_mask, encoder_output, encoder_mask):
decoder_input = self.target_embed(decoder_input)
decoder_input = self.positional_encoding(decoder_input)
decoder_output = self.decoder(decoder_input, decoder_mask, encoder_output, encoder_mask)
return decoder_output

# Projec function takes in decoder output into its projection layer and maps the output to the vocabulary for prediction.
def project(self, decoder_output):
return self.projection_layer(decoder_output)

def build_model(source_vocab_size, target_vocab_size, max_seq_len=1135, d_model=512, d_ff=2048, num_heads=8, num_blocks=6, dropout_rate=0.1):

# Define and assign all the parameters value needed for the transformer architecture
source_embed = EmbeddingLayer(source_vocab_size, d_model)
target_embed = EmbeddingLayer(target_vocab_size, d_model)
positional_encoding = PositionalEncoding(max_seq_len, d_model, dropout_rate)
multihead_attention = MultiHeadAttention(d_model, num_heads, dropout_rate)
masked_multihead_attention = MultiHeadAttention(d_model, num_heads, dropout_rate)
feed_forward = FeedForward(d_model, d_ff, dropout_rate)
projection_layer = ProjectionLayer(target_vocab_size, d_model)
encoder_block = EncoderBlock(multihead_attention, feed_forward, dropout_rate)
decoder_block = DecoderBlock(masked_multihead_attention,multihead_attention, feed_forward, dropout_rate)

encoderblocklist = []
decoderblocklist = []

for _ in range(num_blocks):
encoderblocklist.append(encoder_block)

for _ in range(num_blocks):
decoderblocklist.append(decoder_block)

encoderblocklist = nn.ModuleList(encoderblocklist)
decoderblocklist = nn.ModuleList(decoderblocklist)

encoder = Encoder(encoderblocklist)
decoder = Decoder(decoderblocklist)

# Instantiate the transformer class by providing all the parameters values
model = Transformer(source_embed, target_embed, positional_encoding, multihead_attention, masked_multihead_attention,feed_forward, encoder, decoder, projection_layer, dropout_rate)

for param in model.parameters():
if param.dim() > 1:
nn.init.xavier_uniform_(param)

return model

# Finally, call build model and assign it to model variable.
# This model is now fully ready to train and validate our dataset.
# After training and validation, we can perform new translation task using this very model

model = build_model(source_vocab_size, target_vocab_size)

Step 10: Training and validation of our build LLM model

It is now the time to train our model. The training process is pretty straightforward. We are going to use the training DataLoader which we’ve created in step 3. As the total training dataset number is 1 million, I would highly recommend to train our model on a GPU device. It took me around 5 hr to complete 20 epoch. After each epoch, we are going to save the model weights along with the optimizer state so that it would be easier to resume training from the point before it stopped rather than resume from the start.

After every epoch, we are going to initiate a validation using the validation DataLoader. The size of the validation dataset is 2000 which is pretty reasonable. During the validation process, we just need to calculate the encoder output once until the decoder output gets the end of sentence token [SEP], this is because until the decoder gets [SEP] token, we’ll have to send the same encoder output again, and again which doesn’t make sense.

The decoder input will first start with the start of the sentence token [CLS]. After each prediction, the decoder input will append the next generated token till the end of sentence token [SEP] is reached. Finally, the projection layer maps the output to the corresponding text representation.

def training_model(preload_epoch=None): 

# The entire training, validation cycle will run for 20 times.
EPOCHS = 20
initial_epoch = 0
global_step = 0

# Adam is one of the most commonly used optimization algorithms that hold the current state and will update the parameters based on the computed gradients.
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

# If the preload_epoch is not none, that means the training will start with the weights, optimizer that has been last saved. The new epoch number will be preload epoch + 1.
if preload_epoch is not None:
model_filename = f"./malaygpt/model_{preload_epoch}.pt"
state = torch.load(model_filename)
initial_epoch = state['epoch'] + 1
optimizer.load_state_dict(state['optimizer_state_dict'])
global_step = state['global_step']

# The CrossEntropyLoss loss function computes the difference between the projection output and target label.
loss_fn = nn.CrossEntropyLoss(ignore_index = tokenizer_en.token_to_id("[PAD]"), label_smoothing=0.1).to(device)

for epoch in range(initial_epoch, EPOCHS):

# ::: Start of Training block :::
model.train()

# training with the training dataloder prepared in step 3.
for batch in tqdm(train_dataloader):
encoder_input = batch['encoder_input'].to(device) # (batch_size, seq_len)
decoder_input = batch['decoder_input'].to(device) # (batch_size, seq_len)
target_label = batch['target_label'].to(device) # (batch_size, seq_len)
encoder_mask = batch['encoder_mask'].to(device)
decoder_mask = batch['decoder_mask'].to(device)

encoder_output = model.encode(encoder_input, encoder_mask)
decoder_output = model.decode(decoder_input, decoder_mask, encoder_output, encoder_mask)
projection_output = model.project(decoder_output)

# projection_output(batch_size, seq_len, vocab_size)
loss = loss_fn(projection_output.view(-1, projection_output.shape[-1]), target_label.view(-1))

# backward pass
optimizer.zero_grad()
loss.backward()

# update weights
optimizer.step()
global_step += 1

print(f'Epoch [{epoch+1}/{EPOCHS}]: Train Loss: {loss.item():.2f}')

# save the state of the model after every epoch
model_filename = f"./malaygpt/model_{epoch}.pt"
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'global_step': global_step
}, model_filename)
# ::: End of Training block :::

# ::: Start of Validation block :::
model.eval()
with torch.inference_mode():
for batch in tqdm(val_dataloader):
encoder_input = batch['encoder_input'].to(device) # (batch_size, seq_len)
encoder_mask = batch['encoder_mask'].to(device)
source_text = batch['source_text']
target_text = batch['target_text']

# Computing the output of the encoder for the source sequence.
encoder_output = model.encode(encoder_input, encoder_mask)

# for prediction task, the first token that goes in decoder input is the [CLS] token
decoder_input = torch.empty(1,1).fill_(tokenizer_my.token_to_id('[CLS]')).type_as(encoder_input).to(device)

# since we need to keep adding the output back to the input until the [SEP] - end token is received.
while True:
# check if the max length is received, if it is, then we stop.
if decoder_input.size(1) == max_seq_len:
break

# Recreate mask each time the new output is added the decoder input for next token prediction
decoder_mask = causal_mask(decoder_input.size(1)).type_as(encoder_mask).to(device)

decoder_output = model.decode(decoder_input,decoder_mask,encoder_output,encoder_mask)

# Apply projection only to the next token.
projection = model.project(decoder_output[:, -1])

# Select the token with highest probablity which is a called greedy search implementation.
_, new_token = torch.max(projection, dim=1)
new_token = torch.empty(1,1). type_as(encoder_input).fill_(new_token.item()).to(device)

# Add the new token back to the decoder input.
decoder_input = torch.cat([decoder_input, new_token], dim=1)

# Check if the new token is the end of token, then we stop if received [SEP].
if new_token == tokenizer_my.token_to_id('[SEP]'):
break

# Assigned decoder output as the fully appended decoder input.
decoder_output = decoder_input.sequeeze(0)
model_predicted_text = tokenizer_my.decode(decoder_output.detach().cpu.numpy())

print(f'SOURCE TEXT": {source_text}')
print(f'TARGET TEXT": {target_text}')
print(f'PREDICTED TEXT": {model_predicted_text}')
# ::: End of Validation block :::

# This function runs the training and validation for 20 epochs
training_model(preload_epoch=None)

Step 11: Create a function to test new translation task with our built model

We’ll give a new generic name to our translation function called malaygpt. This takes in user input raw text in English language and output a translated text in Malay language. Let’s run the function and give it a try.

def malaygpt(user_input_text):
model.eval()
with torch.inference_mode():
user_input_text = user_input_text.strip()
user_input_text_encoded = torch.tensor(tokenizer_en.encode(user_input_text).ids, dtype = torch.int64).to(device)

num_source_padding = max_seq_len - len(user_input_text_encoded) - 2
encoder_padding = torch.tensor([PAD_ID] * num_source_padding, dtype = torch.int64).to(device)
encoder_input = torch.cat([CLS_ID, user_input_text_encoded, SEP_ID, encoder_padding]).to(device)
encoder_mask = (encoder_input != PAD_ID).unsqueeze(0).unsqueeze(0).int().to(device)

# Computing the output of the encoder for the source sequence
encoder_output = model.encode(encoder_input, encoder_mask)
# for prediction task, the first token that goes in decoder input is the [CLS] token
decoder_input = torch.empty(1,1).fill_(tokenizer_my.token_to_id('[CLS]')).type_as(encoder_input).to(device)

# since we need to keep adding the output back to the input until the [SEP] - end token is received.
while True:
# check if the max length is received
if decoder_input.size(1) == max_seq_len:
break
# recreate mask each time the new output is added the decoder input for next token prediction
decoder_mask = causal_mask(decoder_input.size(1)).type_as(encoder_mask).to(device)
decoder_output = model.decode(decoder_input,decoder_mask,encoder_output,encoder_mask)

# apply projection only to the next token
projection = model.project(decoder_output[:, -1])

# select the token with highest probablity which is a greedy search implementation
_, new_token = torch.max(projection, dim=1)
new_token = torch.empty(1,1). type_as(encoder_input).fill_(new_token.item()).to(device)

# add the new token back to the decoder input
decoder_input = torch.cat([decoder_input, new_token], dim=1)

# check if the new token is the end of token
if new_token == tokenizer_my.token_to_id('[SEP]'):
break
# final decoder out is the concatinated decoder input till the end token
decoder_output = decoder_input.sequeeze(0)
model_predicted_text = tokenizer_my.decode(decoder_output.detach().cpu.numpy())

return model_predicted_text

Testing Time ! Let’s do some translation testing.

“ The translation seems to be working pretty well. “

And this is it! I am very confident that you are now able to build your own Large Language Model from scratch using PyTorch. You can train this model on other language datasets as well and perform translation tasks in that language. Now, that you’ve learned how to build original transformer from scratch, I can assure you that you’re now capable of learning and implementing your own applications on most of the LLMs that are available in the market now.

What is next? I’ll be building a fully functional application by fine-tuning Llama 3 model, which is one of the most popular open-source LLM model available in the market currently. I’ll be sharing full source code as well.

So, Stay tuned, and Thanks a lot for reading!

Link to Google Colab notebook

References

  • Attention Is All You Need — Paper, Ashish Vaswani, Noam Shazeer, and teams
  • Attention in transformers, visually explained, 3Blue1Brown — youtube
  • Let’s build GPT, Andrej Karpathy, youtube
  • https://github.com/hkproj/pytorch-transformer — Umar Jamil

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓