
Building a Vocabulary for NLP in Modern C++
Last Updated on July 4, 2025 by Editorial Team
Author(s): Luiz doleron | Luiz d’Oleron
Originally published on Towards AI.
Building a vocabulary from a mass of text is one of the first steps in every NLP pipeline. This can be easily achieved if the amount of data is small. If the data is not small, like a real-world dataset, building a vocabulary can be challenging.
The issue of processing massive real-world datasets is that very often, they simply donβt fit into any computer memory. So, things such as sorting the most frequent words, simply cannot be performed as usual. Even if they fit the memory, sorting the whole array costs too much time.
This story explains what to do to overcome these challenges, allowing the generation of vocabularies from real datasets of any size throughout the use of heaps.
Vocabularies are built in the tokenization step. Let us get started by understanding what that is.
Tokenization
Tokenization is the process of breaking a text into smaller units. We can tokenize a text into paragraphs, sentences, words, lexemes, or even into characters.
The most common tokenization is whitespace tokenization, where the text is broken into words. For example, consider the dialogue below:
This text can be broken into words as follows:
As you can see, we removed many language features such as punctuation, non-ASCII characters, and capitalization. This process of reducing the data variance is called data preparation.
The trade-offs of preparing NLP data deserve a proper story.
After separating the tokens, we can count the occurrences:
Now, we can index the words as follows:
We call this table a vocabulary or dictionary. As you can see, a vocabulary is simply a mapping of each word into the respective index.
In general, the indices are sorted from the most common word to the least frequent one.
At this point where things get hard.
It turns out that sorting all tokens is feasible only if the list is not too big. Check out an example:
std::unordered_map<std::string, int>
load_vocabulary(const std::vector<std::string> &data, const size_t MAX_TOKENS)
{
std::vector<std::pair<std::string, int>> pairs;
for (const auto &pair : freqs)
{
pairs.push_back(std::make_pair(pair.first, pair.second));
}
std::sort(pairs.begin(), pairs.end(), [](auto &a, auto &b)
{ return b.second > a.second; });
std::unordered_map<std::string, int> vocabulary;
const size_t limit = std::min(MAX_TOKENS, pairs.size());
for (int i = 0; i < limit; ++i)
{
const auto &pair = pairs[i];
vocabulary[pair.first] = i + 1;
}
return vocabulary;
}
Assuming that the data has N different words, this code costs O(N log N) time complexity and O(N) space complexity. We shall name this approach the brute force way.
The point is that brute force is suitable if N is small. If N is big, the O(N) space complexity can result in infrastructure issues. Even if the system can store all the N words in memory, the time taken to sort the whole list is too long and often impractical.
Hopefully, there exists an approach based on heap trees to overcome this issue, as explained below.
Building vocabularies using heaps
A heap is one of many data structures that uses the metaphor of trees. There are two types of heaps: max-heap and min-heap.
In max-heaps, the element with the maximum value is always at the top of the tree. In min-heaps, the top is always the element with the lowest value.
How does a heap decide if an element is greater or smaller than another? Using a comparison helper function like this:
using ENTRY = std::pair<std::string, int>;
const auto comparator = [](ENTRY &a, ENTRY &b){
return a.second > b.second;
};
We can implement a min-heap or max-heap from scratch using a simple array. Hopefully, STL in modern C++ distributions has heaps already implemented:
#include <priority_queue>
using COMPARATOR = decltype(comparator);
std::priority_queue<ENTRY, std::vector<ENTRY>, COMPARATOR> min_heap(comparator);
This code declares a min-heap. Now, we can realize how to use it to build a vocabulary. The idea is straightforward:
- First, we count the number of occurrences of each token
- Second, we use a heap to find the K most frequent tokens
These steps are explained below.
Counting the number of occurrences
Finding the frequencies of each token can be achieved in linear time. Using std::unordered_map
, we can count the occurrence of each word in a single loop:
using MAP = std::unordered_map<std::string, int>;
MAP freq;
for (const std::string &sentence : data)
{
std::stringstream ss(sentence);
std::string word;
while (ss >> word)
{
to_lower_case(word);
freq[word]++;
}
}
Sincefreq[word]++
takes O(1) on average, the overall time complexity of this loop is O(N).
Now, we can select the K-most frequent tokens to obtain the vocabulary.
As the name indicates, std::unordered_map
do not keep any sense of order. One can suggest the use of std::map
instead. However, usingstd::map
is useless in this context because std::map
keeps order considering the keys, not the values. Furthermore, the time complexity of std::map::operator[] is O(log N).
At this point, heap trees come to the rescue.
Find the k-most frequent tokens
The idea behind using heap trees here is very simple: we will use a heap as a bag where we keep no more than K elements.
In the beginning, the bag is empty. We start by putting elements in the bag until it is full. When the bag is full, we remove the lowest element right after inserting a new one:
// index 0 is reserved for 'no word' and 1 is reseverd for unknown words
const int token_limit = _MAX_TOKENS - 2;
for (auto &it : freq)
{
min_heap.push(std::make_pair(it.first, it.second));
if (min_heap.size() > token_limit)
{
min_heap.pop();
}
}
int index = min_heap.size() + 1;
while (!min_heap.empty())
{
const auto &pair = min_heap.top();
this->dictionary[pair.first] = index;
min_heap.pop();
index--;
}
Note that in a min-heap the lowest is always in the top
This procedure ensures that the heap has no more than K elements. Since we are always removing the lowest element, in the end, the heap contains the K most frequent ones.
Building the vocabulary
Now we have fulfilled the heap, we can finally build the vocabulary:
auto build_vocabulary() const
{
const auto vocabulary = [map = this->dictionary](const S &sentence) mutable
{
std::vector<int> result(_MAX_SEQUENCE_SIZE);
std::stringstream ss(sentence);
S word;
int i = 0;
while (i < _MAX_SEQUENCE_SIZE && ss >> word)
{
to_lower_case(word);
int index = map[word];
if (!index) {
index = _UNKNOW_WORD_INDEX;
}
result[i] = index;
i++;
}
return result;
};
return vocabulary;
}
Note that we reserve the indices 0 and 1 to identify NO_WORD
and UNKNOWN_WORD
respectively.
Using the vocabulary
The central concern in NLP is finding a way to represent text as numbers. We can use the vocabulary to do it very straightforwardly:
Tokenizer<std::string, 10'000, 250> tokenizer;
tokenizer.init(data);
auto vocabulary = tokenizer.build_vocabulary();
std::string sentence = "this movie is excitingdaring and the music is
very goodthe movie moonwalker was meant to coincide wi";
std::vector<int> token_indices = vocabulary(sentence);
// token_indices is {11, 19, 7, 1, 3, 2, 188, 7, 42, 1}
Usually, the next step is using the token_indices
array to index word embeddings. Word embeddings are a powerful way to represent words, allowing the retrieval of semantics from data.
Word embeddings are a perfect talk for another story.
Validating the results
Comparing the two approaches, it is clear that sorting the whole token list is slower than using heaps:
In this experiment, we used 3 datasets:
- IMDB Large Movie Review Dataset (33.1 MB)
- Wikipedia Talk Pages Corpus (107.5 MB)
- Amazon Arts_Crafts_and_Sewing Review Dataset (35 GB)
You can check out the source code on GitHub.
Conclusion
Building vocabularies for huge datasets can be performed using heap trees to reduce latency. Modern C++ provides a native implementation of heaps with thepriority_queue
container from STL.
Source Code
The code used here can be found on GitHub.
Questions, corrections, or concerns
Can you spot something wrong? Drop a message. Have a question? Drop a message. Do you want to talk about the code or contractions? Post your message. It is welcome and free!
How to contribute
Nobody paid me to write this story. If you enjoyed this tutorial and want to contribute, check a charity institution near you and provide your assistance with money and/or time. No citation is needed. You neednβt let me know either.
Have a question? Drop a message and I will reply as soon as possible.
Regards,
Luiz
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI