Transformers Well Explained: Word Embeddings
Last Updated on March 3, 2024 by Editorial Team
Author(s): Ahmad Mustapha
Originally published on Towards AI.
This is part of a four-article series that explains transforms. Each article is associated with a hands-on notebook.
The authors of βAttention is All You Needβ (The research paper that introduced transformers) stated at the beginning of the paper βSimilarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension d_model.β
What are word embeddings? Word embeddings are a way to represent textual data in terms of condensed real-valued vectors. Embeddings, in general, is the process of creating an Euclidean space that represents a group of data mathematically.
Why Not Just Use Indexes
Well, if I have to represent a set of words with numbers, why not simply use indexes like this?
[βBeingβ, βStrongβ, βIsβ, βAllβ, βWhatβ, βMattersβ] -> [1, 2, 3, 4, 5, 6]
Well, we will lose meaning like this. I want to deal with words like numbers but without losing meaning. I want when I do subtract the number representation of βKingβ from the one of βManβ, to get βQueenβ. Is this possible? yes, it is possible.
Meaning as feature
We can achieve the aforementioned behavior by mapping each word to a set of features. Consider the following table, where we have a set of words and a set of features. In it, we are representing each word with a set of βmeaningβ features. A βgrannyβ is a weak old female.
Consider this:
You see. What we are doing is subtracting the manliness features from βgrandpaβ which will lead to βgrannyβ. Now those are embeddings. The feature vectors. They represent words and the relation between them. The question is how to get those embeddings.
Training and Embedding
To train an embedding, we train a neural network to solve a task that requires a semantic understanding of words, and it will eventually have a representation or an embedding that holds semantical meaning.
For more clarity consider a dataset with the following entries:
It is composed of trigrams of a given text. If we trained a simple feedforward neural network to predict the next word from the previous two words. The weights of the external layer of the network will eventually be the transformation function that gives us our embeddings.
Why do we get semantical representation? Consider this word βThe commander orderedβ. For the network to predict βorderedβ it should learn to correlate βorderingβ with βcommanderβ. If we look closely at the embedding, we might also find that βkingβ is also correlated with ordering. Because they are both, the βcommanderβ and the βkingβ, occur in similar contexts. In this case, both the embedding of βkingβ and βcommanderβ will be somehow similar.
All transformers have an embedding layer. It is trained end-to-end alongside the other layers on top of it. It is as if we are telling the network to learn the semantics of words in such a way, that we can achieve the translation task or the prediction of the next word task. That is why embeddings are different for different tasks. To this end, I made the word embeddings clearer. Keep reading if you want hands-on understanding.
Training an Arabic Embedding
Arabic is the language used in the Middle East. We will train an embedding for Arabic words. This hands-on tutorial assumes knowledge of machine learning and Pytorch. The full code is available in the βGenerate Word Embeddings.ipynbβ notebook.
To train an embedding, we scraped a famous Arabic magazine and collected 1000 articles. The articles are found in the βalaraby1k.jsonβ file. The task we will use to train the embedding is to predict a third word given two previous words. The task is framed as a straightforward classification task. The model should predict the id of the third word. Assuming we have approx. 165,000 words. Then we have 165,000 classes in the outer layer. Below is a snippet of the dataset.
{
'author': 'Ψ¨ΩΨ§Ψ³Ψ·Ψ© ΩΨΉΩ
Ψ§Ω Ψ΅Ψ§Ψ±Ψ¨',
'section': 'Ω
ΩΨͺΨ―Ω Ψ§ΩΨΉΨ±Ψ¨Ω',
'issue': '410',
'text': ' ΨͺΨΉΨ±ΩΩΩ ΩΨ§ΩΨ³ΩΩ ΩΨ§ΩΨ±Ω
Ψ ΩΨ§ΩΩΨ±Ψ·Ψ§Ψ³ ΩΨ§ΩΩΩΩ
Ψ₯ΩΨ§ Ψ£Ω Ψ¨ΩΨͺΨ§ ΩΨ§ΨΨ―Ψ§ Ψ£Ω ΨΨͺΩ '
}j
First of all, we want to create the data set. We will extend the pytorch βDatasetβ class to encapsulate the data. During the initialization of the dataset, we read the JSON file, split the data into train and test, and generate the trigrams tokens.
In every transformer training process, we need to have a vocab. The vocab is the unique words that are data is composed of. During the dataset initialization, we find all unique words and store them in a set and then define two dictionaries that map a given vocabulary to its index and the second to map a given index to its vocabulary. We need the dictionaries because we read words, but we feed the network indices. We also get numbers, i.e., indices, from the network output, but we need words. Thus, those dictionaries.
The network will eventually be trained, and we will run some tests, and it will happen that a word in the test set is not even in the vocabulary of the model. We need to map this word to an index and a common word for an unknown word. For this, we add to the vocab the word β<UNKOWN>β of index zero to replace every word that is not in the vocab. The code will end up being like this:
class MyDataset(Dataset):
def __init__(self, alaraby_filepath, is_train):
self.raw_data = [article["text"] for article in json.load(open(alaraby_filepath, "r"))]
self.train_raw_data, self.test_raw_data = train_test_split(self.raw_data, test_size=0.1, random_state=42)
self.train_trigrams = self.__generate_trigrams__(self.train_raw_data)
self.test_trigrams = self.__generate_trigrams__(self.test_raw_data)
self.vocab, self.id_to_word, self.word_to_id = self.__compute_vocab__(self.train_raw_data)
self.is_train = is_train
def __generate_trigrams__(self, texts):
trigrams = []
for text in texts:
words = text.split()
article_trigrams = [words[i:i+3] for i in range(len(words)-2)]
trigrams+= article_trigrams
return trigrams
def __compute_vocab__(self, texts):
# Get unique words
words = set()
for text in texts:
words.update(set(text.split()))
words_list = ["<UNKOWN>"] + list(words)
id_to_word = defaultdict(lambda: "<UNKOWN>", {idx: value for idx, value in enumerate(words_list)})
word_to_id = defaultdict(lambda: 0, {value: idx for idx, value in enumerate(words_list)})
return words, id_to_word, word_to_id
def __len__(self):
return len(self.train_trigrams) if self.is_train else len(self.test_trigrams)
def __getitem__(self, idx):
trigrams = self.train_trigrams if self.is_train else self.test_trigrams
trigram = [ self.word_to_id[word] for word in trigrams[idx]]
return tuple(trigram)
def get_word_from_id(self, idx):
return self.id_to_word[idx]
def get_word_id(self, word):
return self.word_to_id[word]
def get_vocab_size(self):
return len(self.vocab)
To build the neural network, we will use the βnn.embeddingβ layer from Pytorch. The layer is comparable to a linear layer however it accepts indices rather than one-hot encoded vectors. Those indices are the vocab indices of words. The layer accepts two parameters the βvocabulary sizeβ and the βembedding sizeβ. The latter is the vector size that we want to express words using it. The larger the size the more the network can learn semantical rules. It is important for the embedding to be relatively large so that the network has more degrees of freedom.
class NextWordPredictor(nn.Module):
def __init__(self, vocab_size, embedding_dim):
super(NextWordPredictor, self).__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.linear = nn.Linear(embedding_dim*2, vocab_size)
def forward(self, x1, x2):
embedded1 = self.embedding(x1)
embedded2 = self.embedding(x2)
concatenated = torch.cat((embedded1, embedded2), dim=1)
output = self.linear(concatenated)
return output
We finally train the network
device = torch.device("cuda:0" ) if torch.cuda.is_available() else torch.device("cpu" )
batch_size= 1500
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)
vocab_size = 165_000
embbeding_dim = 1024
model = NextWordPredictor(vocab_size, embbeding_dim)
model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = SGD(model.parameters(), lr=0.01)
# Training loop
epochs = 100
for epoch in range(0, epochs):
for i, batch in enumerate(train_dataloader):
optimizer.zero_grad()
x1, x2, target = batch
x1 = x1.to(device)
x2 = x2.to(device)
target = target.to(device)
output = model(x1, x2)
loss = criterion(output, target)
loss.backward()
optimizer.step()
if(i % 10==0):
print(f"Epoch {epoch } Batch {i}, Loss: {loss.item()}")
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI