Building a Vocabulary for NLP in Modern C++

Author(s): Luiz doleron | Luiz d’Oleron

Originally published on Towards AI.

Building a vocabulary from a mass of text is one of the first steps in every NLP pipeline. This can be easily achieved if the amount of data is small. If the data is not small, like a real-world dataset, building a vocabulary can be challenging.

The issue of processing massive real-world datasets is that very often, they simply don’t fit into any computer memory. So, things such as sorting the most frequent words, simply cannot be performed as usual. Even if they fit the memory, sorting the whole array costs too much time.

This story explains what to do to overcome these challenges, allowing the generation of vocabularies from real datasets of any size throughout the use of heaps.

Vocabularies are built in the tokenization step. Let us get started by understanding what that is.

Tokenization

Tokenization is the process of breaking a text into smaller units. We can tokenize a text into paragraphs, sentences, words, lexemes, or even into characters.

The most common tokenization is whitespace tokenization, where the text is broken into words. For example, consider the dialogue below:

Building a Vocabulary for NLP in Modern C++ — Standard talk nowadays

This text can be broken into words as follows:

As you can see, we removed many language features such as punctuation, non-ASCII characters, and capitalization. This process of reducing the data variance is called data preparation.

The trade-offs of preparing NLP data deserve a proper story.

After separating the tokens, we can count the occurrences:

Now, we can index the words as follows:

We call this table a vocabulary or dictionary. As you can see, a vocabulary is simply a mapping of each word into the respective index.

In general, the indices are sorted from the most common word to the least frequent one.

At this point where things get hard.

It turns out that sorting all tokens is feasible only if the list is not too big. Check out an example:

std::unordered_map<std::string, int> 
load_vocabulary(const std::vector<std::string> &data, const size_t MAX_TOKENS) 
{
 std::vector<std::pair<std::string, int>> pairs;

 for (const auto &pair : freqs)
 {
 pairs.push_back(std::make_pair(pair.first, pair.second));
 }
 std::sort(pairs.begin(), pairs.end(), [](auto &a, auto &b)
 { return b.second > a.second; });

 std::unordered_map<std::string, int> vocabulary;

 const size_t limit = std::min(MAX_TOKENS, pairs.size());
 for (int i = 0; i < limit; ++i)
 {
 const auto &pair = pairs[i];
 vocabulary[pair.first] = i + 1;
 }
 return vocabulary;
}

Assuming that the data has N different words, this code costs O(N log N) time complexity and O(N) space complexity. We shall name this approach the brute force way.

The point is that brute force is suitable if N is small. If N is big, the O(N) space complexity can result in infrastructure issues. Even if the system can store all the N words in memory, the time taken to sort the whole list is too long and often impractical.

Hopefully, there exists an approach based on heap trees to overcome this issue, as explained below.

Building vocabularies using heaps

A heap is one of many data structures that uses the metaphor of trees. There are two types of heaps: max-heap and min-heap.

In max-heaps, the element with the maximum value is always at the top of the tree. In min-heaps, the top is always the element with the lowest value.

How does a heap decide if an element is greater or smaller than another? Using a comparison helper function like this:

using ENTRY = std::pair<std::string, int>;

const auto comparator = [](ENTRY &a, ENTRY &b){
 return a.second > b.second;
};

We can implement a min-heap or max-heap from scratch using a simple array. Hopefully, STL in modern C++ distributions has heaps already implemented:

#include <priority_queue>

using COMPARATOR = decltype(comparator);

std::priority_queue<ENTRY, std::vector<ENTRY>, COMPARATOR> min_heap(comparator);

This code declares a min-heap. Now, we can realize how to use it to build a vocabulary. The idea is straightforward:

First, we count the number of occurrences of each token
Second, we use a heap to find the K most frequent tokens

These steps are explained below.

Counting the number of occurrences

Finding the frequencies of each token can be achieved in linear time. Using std::unordered_map, we can count the occurrence of each word in a single loop:

using MAP = std::unordered_map<std::string, int>;

MAP freq;
for (const std::string &sentence : data)
{
 std::stringstream ss(sentence);

 std::string word;

 while (ss >> word)
 {
 to_lower_case(word);
 freq[word]++;
 }
}

Sincefreq[word]++ takes O(1) on average, the overall time complexity of this loop is O(N).

Now, we can select the K-most frequent tokens to obtain the vocabulary.

As the name indicates, std::unordered_map do not keep any sense of order. One can suggest the use of std::map instead. However, usingstd::map is useless in this context because std::map keeps order considering the keys, not the values. Furthermore, the time complexity of std::map::operator[] is O(log N).

At this point, heap trees come to the rescue.

Find the k-most frequent tokens

The idea behind using heap trees here is very simple: we will use a heap as a bag where we keep no more than K elements.

In the beginning, the bag is empty. We start by putting elements in the bag until it is full. When the bag is full, we remove the lowest element right after inserting a new one:

// index 0 is reserved for 'no word' and 1 is reseverd for unknown words
const int token_limit = _MAX_TOKENS - 2; 

for (auto &it : freq)
{
 min_heap.push(std::make_pair(it.first, it.second));

 if (min_heap.size() > token_limit)
 {
 min_heap.pop();
 }
}

int index = min_heap.size() + 1;
while (!min_heap.empty())
{
 const auto &pair = min_heap.top();
 this->dictionary[pair.first] = index;
 min_heap.pop();
 index--;
}

Note that in a min-heap the lowest is always in the top

This procedure ensures that the heap has no more than K elements. Since we are always removing the lowest element, in the end, the heap contains the K most frequent ones.

Building the vocabulary

Now we have fulfilled the heap, we can finally build the vocabulary:

auto build_vocabulary() const
{
 const auto vocabulary = [map = this->dictionary](const S &sentence) mutable
 {
 std::vector<int> result(_MAX_SEQUENCE_SIZE);

 std::stringstream ss(sentence);
 S word;
 int i = 0;
 while (i < _MAX_SEQUENCE_SIZE && ss >> word)
 {
 to_lower_case(word);
 int index = map[word];
 if (!index) {
 index = _UNKNOW_WORD_INDEX;
 }
 result[i] = index;
 i++;
 }

 return result;
 };

 return vocabulary;
}

Note that we reserve the indices 0 and 1 to identify NO_WORD and UNKNOWN_WORD respectively.

Using the vocabulary

The central concern in NLP is finding a way to represent text as numbers. We can use the vocabulary to do it very straightforwardly:

Tokenizer<std::string, 10'000, 250> tokenizer;
tokenizer.init(data);

auto vocabulary = tokenizer.build_vocabulary();

std::string sentence = "this movie is excitingdaring and the music is 
 very goodthe movie moonwalker was meant to coincide wi";

std::vector<int> token_indices = vocabulary(sentence);

// token_indices is {11, 19, 7, 1, 3, 2, 188, 7, 42, 1}

Usually, the next step is using the token_indices array to index word embeddings. Word embeddings are a powerful way to represent words, allowing the retrieval of semantics from data.

Word embeddings are a perfect talk for another story.

Validating the results

Comparing the two approaches, it is clear that sorting the whole token list is slower than using heaps:

In this experiment, we used 3 datasets:

IMDB Large Movie Review Dataset (33.1 MB)
Wikipedia Talk Pages Corpus (107.5 MB)
Amazon Arts_Crafts_and_Sewing Review Dataset (35 GB)

You can check out the source code on GitHub.

Conclusion

Building vocabularies for huge datasets can be performed using heap trees to reduce latency. Modern C++ provides a native implementation of heaps with thepriority_queue container from STL.

Source Code

The code used here can be found on GitHub.

Questions, corrections, or concerns

Can you spot something wrong? Drop a message. Have a question? Drop a message. Do you want to talk about the code or contractions? Post your message. It is welcome and free!

How to contribute

Nobody paid me to write this story. If you enjoyed this tutorial and want to contribute, check a charity institution near you and provide your assistance with money and/or time. No citation is needed. You needn’t let me know either.

Have a question? Drop a message and I will reply as soon as possible.

Regards,
Luiz

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Building a Vocabulary for NLP in Modern C++

Author(s): Luiz doleron | Luiz d’Oleron

Tokenization

Building vocabularies using heaps

Counting the number of occurrences

Find the k-most frequent tokens

Building the vocabulary

Using the vocabulary

Validating the results

Conclusion

Source Code

Questions, corrections, or concerns

How to contribute

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Building a Vocabulary for NLP in Modern C++

Author(s): Luiz doleron | Luiz d’Oleron

Tokenization

Building vocabularies using heaps

Counting the number of occurrences

Find the k-most frequent tokens

Building the vocabulary

Using the vocabulary

Validating the results

Conclusion

Source Code

Questions, corrections, or concerns

How to contribute

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement