Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Transformers Well Explained: Word Embeddings
Latest   Machine Learning

Transformers Well Explained: Word Embeddings

Last Updated on March 3, 2024 by Editorial Team

Author(s): Ahmad Mustapha

Originally published on Towards AI.

This is part of a four-article series that explains transforms. Each article is associated with a hands-on notebook.

Photo by Mike Uderevsky on Unsplash

The authors of β€œAttention is All You Need” (The research paper that introduced transformers) stated at the beginning of the paper β€œSimilarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension d_model.”

What are word embeddings? Word embeddings are a way to represent textual data in terms of condensed real-valued vectors. Embeddings, in general, is the process of creating an Euclidean space that represents a group of data mathematically.

Why Not Just Use Indexes

Well, if I have to represent a set of words with numbers, why not simply use indexes like this?

 [β€œBeing”, β€œStrong”, β€œIs”, β€œAll”, β€œWhat”, β€œMatters”] -> [1, 2, 3, 4, 5, 6]

Well, we will lose meaning like this. I want to deal with words like numbers but without losing meaning. I want when I do subtract the number representation of β€œKing” from the one of β€œMan”, to get β€œQueen”. Is this possible? yes, it is possible.

Meaning as feature

We can achieve the aforementioned behavior by mapping each word to a set of features. Consider the following table, where we have a set of words and a set of features. In it, we are representing each word with a set of β€œmeaning” features. A β€œgranny” is a weak old female.

Consider this:

You see. What we are doing is subtracting the manliness features from β€œgrandpa” which will lead to β€œgranny”. Now those are embeddings. The feature vectors. They represent words and the relation between them. The question is how to get those embeddings.

Training and Embedding

To train an embedding, we train a neural network to solve a task that requires a semantic understanding of words, and it will eventually have a representation or an embedding that holds semantical meaning.

For more clarity consider a dataset with the following entries:

It is composed of trigrams of a given text. If we trained a simple feedforward neural network to predict the next word from the previous two words. The weights of the external layer of the network will eventually be the transformation function that gives us our embeddings.

Why do we get semantical representation? Consider this word β€œThe commander ordered”. For the network to predict β€œordered” it should learn to correlate β€œordering” with β€œcommander”. If we look closely at the embedding, we might also find that β€œking” is also correlated with ordering. Because they are both, the β€œcommander” and the β€œking”, occur in similar contexts. In this case, both the embedding of β€œking” and β€œcommander” will be somehow similar.

All transformers have an embedding layer. It is trained end-to-end alongside the other layers on top of it. It is as if we are telling the network to learn the semantics of words in such a way, that we can achieve the translation task or the prediction of the next word task. That is why embeddings are different for different tasks. To this end, I made the word embeddings clearer. Keep reading if you want hands-on understanding.

Training an Arabic Embedding

Arabic is the language used in the Middle East. We will train an embedding for Arabic words. This hands-on tutorial assumes knowledge of machine learning and Pytorch. The full code is available in the β€œGenerate Word Embeddings.ipynb” notebook.

To train an embedding, we scraped a famous Arabic magazine and collected 1000 articles. The articles are found in the β€œalaraby1k.json” file. The task we will use to train the embedding is to predict a third word given two previous words. The task is framed as a straightforward classification task. The model should predict the id of the third word. Assuming we have approx. 165,000 words. Then we have 165,000 classes in the outer layer. Below is a snippet of the dataset.

{
'author': 'بواسطة Ω†ΨΉΩ…Ψ§Ω† Ψ΅Ψ§Ψ±Ψ¨',
'section': 'Ω…Ω†ΨͺΨ―Ω‰ Ψ§Ω„ΨΉΨ±Ψ¨ΩŠ',
'issue': '410',
'text': ' ΨͺΨΉΨ±ΩΩ†ΩŠ ΩˆΨ§Ω„Ψ³ΩŠΩ ΩˆΨ§Ω„Ψ±Ω…Ψ­ ΩˆΨ§Ω„Ω‚Ψ±Ψ·Ψ§Ψ³ ΩˆΨ§Ω„Ω‚Ω„Ω… Ψ₯Ω„Ψ§ Ψ£Ω† بيΨͺΨ§ واحدا أو Ψ­ΨͺΩ‰ '
}j

First of all, we want to create the data set. We will extend the pytorch β€œDataset” class to encapsulate the data. During the initialization of the dataset, we read the JSON file, split the data into train and test, and generate the trigrams tokens.

In every transformer training process, we need to have a vocab. The vocab is the unique words that are data is composed of. During the dataset initialization, we find all unique words and store them in a set and then define two dictionaries that map a given vocabulary to its index and the second to map a given index to its vocabulary. We need the dictionaries because we read words, but we feed the network indices. We also get numbers, i.e., indices, from the network output, but we need words. Thus, those dictionaries.

The network will eventually be trained, and we will run some tests, and it will happen that a word in the test set is not even in the vocabulary of the model. We need to map this word to an index and a common word for an unknown word. For this, we add to the vocab the word β€œ<UNKOWN>” of index zero to replace every word that is not in the vocab. The code will end up being like this:

class MyDataset(Dataset):

def __init__(self, alaraby_filepath, is_train):
self.raw_data = [article["text"] for article in json.load(open(alaraby_filepath, "r"))]
self.train_raw_data, self.test_raw_data = train_test_split(self.raw_data, test_size=0.1, random_state=42)
self.train_trigrams = self.__generate_trigrams__(self.train_raw_data)
self.test_trigrams = self.__generate_trigrams__(self.test_raw_data)
self.vocab, self.id_to_word, self.word_to_id = self.__compute_vocab__(self.train_raw_data)
self.is_train = is_train

def __generate_trigrams__(self, texts):
trigrams = []
for text in texts:
words = text.split()
article_trigrams = [words[i:i+3] for i in range(len(words)-2)]
trigrams+= article_trigrams
return trigrams

def __compute_vocab__(self, texts):
# Get unique words
words = set()
for text in texts:
words.update(set(text.split()))
words_list = ["<UNKOWN>"] + list(words)
id_to_word = defaultdict(lambda: "<UNKOWN>", {idx: value for idx, value in enumerate(words_list)})
word_to_id = defaultdict(lambda: 0, {value: idx for idx, value in enumerate(words_list)})
return words, id_to_word, word_to_id

def __len__(self):
return len(self.train_trigrams) if self.is_train else len(self.test_trigrams)


def __getitem__(self, idx):
trigrams = self.train_trigrams if self.is_train else self.test_trigrams
trigram = [ self.word_to_id[word] for word in trigrams[idx]]
return tuple(trigram)

def get_word_from_id(self, idx):
return self.id_to_word[idx]

def get_word_id(self, word):
return self.word_to_id[word]

def get_vocab_size(self):
return len(self.vocab)

To build the neural network, we will use the β€œnn.embedding” layer from Pytorch. The layer is comparable to a linear layer however it accepts indices rather than one-hot encoded vectors. Those indices are the vocab indices of words. The layer accepts two parameters the β€œvocabulary size” and the β€œembedding size”. The latter is the vector size that we want to express words using it. The larger the size the more the network can learn semantical rules. It is important for the embedding to be relatively large so that the network has more degrees of freedom.


class NextWordPredictor(nn.Module):
def __init__(self, vocab_size, embedding_dim):
super(NextWordPredictor, self).__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.linear = nn.Linear(embedding_dim*2, vocab_size)

def forward(self, x1, x2):
embedded1 = self.embedding(x1)
embedded2 = self.embedding(x2)
concatenated = torch.cat((embedded1, embedded2), dim=1)
output = self.linear(concatenated)
return output

We finally train the network


device = torch.device("cuda:0" ) if torch.cuda.is_available() else torch.device("cpu" )

batch_size= 1500
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

vocab_size = 165_000
embbeding_dim = 1024
model = NextWordPredictor(vocab_size, embbeding_dim)
model.to(device)

criterion = nn.CrossEntropyLoss()
optimizer = SGD(model.parameters(), lr=0.01)

# Training loop
epochs = 100
for epoch in range(0, epochs):
for i, batch in enumerate(train_dataloader):
optimizer.zero_grad()
x1, x2, target = batch
x1 = x1.to(device)
x2 = x2.to(device)
target = target.to(device)
output = model(x1, x2)
loss = criterion(output, target)
loss.backward()
optimizer.step()
if(i % 10==0):
print(f"Epoch {epoch } Batch {i}, Loss: {loss.item()}")

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓