Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!


Transformers Well Explained: Word Embeddings
Latest   Machine Learning

Transformers Well Explained: Word Embeddings

Last Updated on March 3, 2024 by Editorial Team

Author(s): Ahmad Mustapha

Originally published on Towards AI.

This is part of a four-article series that explains transforms. Each article is associated with a hands-on notebook.

Photo by Mike Uderevsky on Unsplash

The authors of “Attention is All You Need” (The research paper that introduced transformers) stated at the beginning of the paper “Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension d_model.”

What are word embeddings? Word embeddings are a way to represent textual data in terms of condensed real-valued vectors. Embeddings, in general, is the process of creating an Euclidean space that represents a group of data mathematically.

Why Not Just Use Indexes

Well, if I have to represent a set of words with numbers, why not simply use indexes like this?

 [“Being”, “Strong”, “Is”, “All”, “What”, “Matters”] -> [1, 2, 3, 4, 5, 6]

Well, we will lose meaning like this. I want to deal with words like numbers but without losing meaning. I want when I do subtract the number representation of “King” from the one of “Man”, to get “Queen”. Is this possible? yes, it is possible.

Meaning as feature

We can achieve the aforementioned behavior by mapping each word to a set of features. Consider the following table, where we have a set of words and a set of features. In it, we are representing each word with a set of “meaning” features. A “granny” is a weak old female.

Consider this:

You see. What we are doing is subtracting the manliness features from “grandpa” which will lead to “granny”. Now those are embeddings. The feature vectors. They represent words and the relation between them. The question is how to get those embeddings.

Training and Embedding

To train an embedding, we train a neural network to solve a task that requires a semantic understanding of words, and it will eventually have a representation or an embedding that holds semantical meaning.

For more clarity consider a dataset with the following entries:

It is composed of trigrams of a given text. If we trained a simple feedforward neural network to predict the next word from the previous two words. The weights of the external layer of the network will eventually be the transformation function that gives us our embeddings.

Why do we get semantical representation? Consider this word “The commander ordered”. For the network to predict “ordered” it should learn to correlate “ordering” with “commander”. If we look closely at the embedding, we might also find that “king” is also correlated with ordering. Because they are both, the “commander” and the “king”, occur in similar contexts. In this case, both the embedding of “king” and “commander” will be somehow similar.

All transformers have an embedding layer. It is trained end-to-end alongside the other layers on top of it. It is as if we are telling the network to learn the semantics of words in such a way, that we can achieve the translation task or the prediction of the next word task. That is why embeddings are different for different tasks. To this end, I made the word embeddings clearer. Keep reading if you want hands-on understanding.

Training an Arabic Embedding

Arabic is the language used in the Middle East. We will train an embedding for Arabic words. This hands-on tutorial assumes knowledge of machine learning and Pytorch. The full code is available in the “Generate Word Embeddings.ipynb” notebook.

To train an embedding, we scraped a famous Arabic magazine and collected 1000 articles. The articles are found in the “alaraby1k.json” file. The task we will use to train the embedding is to predict a third word given two previous words. The task is framed as a straightforward classification task. The model should predict the id of the third word. Assuming we have approx. 165,000 words. Then we have 165,000 classes in the outer layer. Below is a snippet of the dataset.

'author': 'بواسطة نعمان صارب',
'section': 'منتدى العربي',
'issue': '410',
'text': ' تعرفني والسيف والرمح والقرطاس والقلم إلا أن بيتا واحدا أو حتى '

First of all, we want to create the data set. We will extend the pytorch “Dataset” class to encapsulate the data. During the initialization of the dataset, we read the JSON file, split the data into train and test, and generate the trigrams tokens.

In every transformer training process, we need to have a vocab. The vocab is the unique words that are data is composed of. During the dataset initialization, we find all unique words and store them in a set and then define two dictionaries that map a given vocabulary to its index and the second to map a given index to its vocabulary. We need the dictionaries because we read words, but we feed the network indices. We also get numbers, i.e., indices, from the network output, but we need words. Thus, those dictionaries.

The network will eventually be trained, and we will run some tests, and it will happen that a word in the test set is not even in the vocabulary of the model. We need to map this word to an index and a common word for an unknown word. For this, we add to the vocab the word “<UNKOWN>” of index zero to replace every word that is not in the vocab. The code will end up being like this:

class MyDataset(Dataset):

def __init__(self, alaraby_filepath, is_train):
self.raw_data = [article["text"] for article in json.load(open(alaraby_filepath, "r"))]
self.train_raw_data, self.test_raw_data = train_test_split(self.raw_data, test_size=0.1, random_state=42)
self.train_trigrams = self.__generate_trigrams__(self.train_raw_data)
self.test_trigrams = self.__generate_trigrams__(self.test_raw_data)
self.vocab, self.id_to_word, self.word_to_id = self.__compute_vocab__(self.train_raw_data)
self.is_train = is_train

def __generate_trigrams__(self, texts):
trigrams = []
for text in texts:
words = text.split()
article_trigrams = [words[i:i+3] for i in range(len(words)-2)]
trigrams+= article_trigrams
return trigrams

def __compute_vocab__(self, texts):
# Get unique words
words = set()
for text in texts:
words_list = ["<UNKOWN>"] + list(words)
id_to_word = defaultdict(lambda: "<UNKOWN>", {idx: value for idx, value in enumerate(words_list)})
word_to_id = defaultdict(lambda: 0, {value: idx for idx, value in enumerate(words_list)})
return words, id_to_word, word_to_id

def __len__(self):
return len(self.train_trigrams) if self.is_train else len(self.test_trigrams)

def __getitem__(self, idx):
trigrams = self.train_trigrams if self.is_train else self.test_trigrams
trigram = [ self.word_to_id[word] for word in trigrams[idx]]
return tuple(trigram)

def get_word_from_id(self, idx):
return self.id_to_word[idx]

def get_word_id(self, word):
return self.word_to_id[word]

def get_vocab_size(self):
return len(self.vocab)

To build the neural network, we will use the “nn.embedding” layer from Pytorch. The layer is comparable to a linear layer however it accepts indices rather than one-hot encoded vectors. Those indices are the vocab indices of words. The layer accepts two parameters the “vocabulary size” and the “embedding size”. The latter is the vector size that we want to express words using it. The larger the size the more the network can learn semantical rules. It is important for the embedding to be relatively large so that the network has more degrees of freedom.

class NextWordPredictor(nn.Module):
def __init__(self, vocab_size, embedding_dim):
super(NextWordPredictor, self).__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.linear = nn.Linear(embedding_dim*2, vocab_size)

def forward(self, x1, x2):
embedded1 = self.embedding(x1)
embedded2 = self.embedding(x2)
concatenated =, embedded2), dim=1)
output = self.linear(concatenated)
return output

We finally train the network

device = torch.device("cuda:0" ) if torch.cuda.is_available() else torch.device("cpu" )

batch_size= 1500
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

vocab_size = 165_000
embbeding_dim = 1024
model = NextWordPredictor(vocab_size, embbeding_dim)

criterion = nn.CrossEntropyLoss()
optimizer = SGD(model.parameters(), lr=0.01)

# Training loop
epochs = 100
for epoch in range(0, epochs):
for i, batch in enumerate(train_dataloader):
x1, x2, target = batch
x1 =
x2 =
target =
output = model(x1, x2)
loss = criterion(output, target)
if(i % 10==0):
print(f"Epoch {epoch } Batch {i}, Loss: {loss.item()}")

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓