Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-FranΓ§ois Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Building a Vocabulary for NLP in Modern C++
Artificial Intelligence   Latest   Machine Learning

Building a Vocabulary for NLP in Modern C++

Last Updated on July 4, 2025 by Editorial Team

Author(s): Luiz doleron | Luiz d’Oleron

Originally published on Towards AI.

Building a vocabulary from a mass of text is one of the first steps in every NLP pipeline. This can be easily achieved if the amount of data is small. If the data is not small, like a real-world dataset, building a vocabulary can be challenging.

The issue of processing massive real-world datasets is that very often, they simply don’t fit into any computer memory. So, things such as sorting the most frequent words, simply cannot be performed as usual. Even if they fit the memory, sorting the whole array costs too much time.

This story explains what to do to overcome these challenges, allowing the generation of vocabularies from real datasets of any size throughout the use of heaps.

Vocabularies are built in the tokenization step. Let us get started by understanding what that is.

Tokenization

Tokenization is the process of breaking a text into smaller units. We can tokenize a text into paragraphs, sentences, words, lexemes, or even into characters.

The most common tokenization is whitespace tokenization, where the text is broken into words. For example, consider the dialogue below:

Standard talk nowadays

This text can be broken into words as follows:

As you can see, we removed many language features such as punctuation, non-ASCII characters, and capitalization. This process of reducing the data variance is called data preparation.

The trade-offs of preparing NLP data deserve a proper story.

After separating the tokens, we can count the occurrences:

Table of token frequencies

Now, we can index the words as follows:

We call this table a vocabulary or dictionary. As you can see, a vocabulary is simply a mapping of each word into the respective index.

In general, the indices are sorted from the most common word to the least frequent one.

At this point where things get hard.

It turns out that sorting all tokens is feasible only if the list is not too big. Check out an example:

std::unordered_map<std::string, int> 
load_vocabulary(const std::vector<std::string> &data, const size_t MAX_TOKENS)
{
std::vector<std::pair<std::string, int>> pairs;

for (const auto &pair : freqs)
{
pairs.push_back(std::make_pair(pair.first, pair.second));
}
std::sort(pairs.begin(), pairs.end(), [](auto &a, auto &b)
{ return b.second > a.second; });

std::unordered_map<std::string, int> vocabulary;

const size_t limit = std::min(MAX_TOKENS, pairs.size());
for (int i = 0; i < limit; ++i)
{
const auto &pair = pairs[i];
vocabulary[pair.first] = i + 1;
}
return vocabulary;
}

Assuming that the data has N different words, this code costs O(N log N) time complexity and O(N) space complexity. We shall name this approach the brute force way.

The point is that brute force is suitable if N is small. If N is big, the O(N) space complexity can result in infrastructure issues. Even if the system can store all the N words in memory, the time taken to sort the whole list is too long and often impractical.

Hopefully, there exists an approach based on heap trees to overcome this issue, as explained below.

Building vocabularies using heaps

A heap is one of many data structures that uses the metaphor of trees. There are two types of heaps: max-heap and min-heap.

Example of min-heap

In max-heaps, the element with the maximum value is always at the top of the tree. In min-heaps, the top is always the element with the lowest value.

How does a heap decide if an element is greater or smaller than another? Using a comparison helper function like this:

using ENTRY = std::pair<std::string, int>;

const auto comparator = [](ENTRY &a, ENTRY &b){
return a.second > b.second;
};

We can implement a min-heap or max-heap from scratch using a simple array. Hopefully, STL in modern C++ distributions has heaps already implemented:

#include <priority_queue>

using COMPARATOR = decltype(comparator);

std::priority_queue<ENTRY, std::vector<ENTRY>, COMPARATOR> min_heap(comparator);

This code declares a min-heap. Now, we can realize how to use it to build a vocabulary. The idea is straightforward:

  • First, we count the number of occurrences of each token
  • Second, we use a heap to find the K most frequent tokens

These steps are explained below.

Counting the number of occurrences

Finding the frequencies of each token can be achieved in linear time. Using std::unordered_map, we can count the occurrence of each word in a single loop:

using MAP = std::unordered_map<std::string, int>;

MAP freq;
for (const std::string &sentence : data)
{
std::stringstream ss(sentence);

std::string word;

while (ss >> word)
{
to_lower_case(word);
freq[word]++;
}
}

Sincefreq[word]++ takes O(1) on average, the overall time complexity of this loop is O(N).

Now, we can select the K-most frequent tokens to obtain the vocabulary.

As the name indicates, std::unordered_map do not keep any sense of order. One can suggest the use of std::map instead. However, usingstd::map is useless in this context because std::map keeps order considering the keys, not the values. Furthermore, the time complexity of std::map::operator[] is O(log N).

At this point, heap trees come to the rescue.

Find the k-most frequent tokens

The idea behind using heap trees here is very simple: we will use a heap as a bag where we keep no more than K elements.

Example of min-heap with K = 7

In the beginning, the bag is empty. We start by putting elements in the bag until it is full. When the bag is full, we remove the lowest element right after inserting a new one:

// index 0 is reserved for 'no word' and 1 is reseverd for unknown words
const int token_limit = _MAX_TOKENS - 2;

for (auto &it : freq)
{
min_heap.push(std::make_pair(it.first, it.second));

if (min_heap.size() > token_limit)
{
min_heap.pop();
}
}

int index = min_heap.size() + 1;
while (!min_heap.empty())
{
const auto &pair = min_heap.top();
this->dictionary[pair.first] = index;
min_heap.pop();
index--;
}

Note that in a min-heap the lowest is always in the top

This procedure ensures that the heap has no more than K elements. Since we are always removing the lowest element, in the end, the heap contains the K most frequent ones.

Building the vocabulary

Now we have fulfilled the heap, we can finally build the vocabulary:

auto build_vocabulary() const
{
const auto vocabulary = [map = this->dictionary](const S &sentence) mutable
{
std::vector<int> result(_MAX_SEQUENCE_SIZE);

std::stringstream ss(sentence);
S word;
int i = 0;
while (i < _MAX_SEQUENCE_SIZE && ss >> word)
{
to_lower_case(word);
int index = map[word];
if (!index) {
index = _UNKNOW_WORD_INDEX;
}
result[i] = index;
i++;
}

return result;
};

return vocabulary;
}

Note that we reserve the indices 0 and 1 to identify NO_WORD and UNKNOWN_WORD respectively.

Using the vocabulary

The central concern in NLP is finding a way to represent text as numbers. We can use the vocabulary to do it very straightforwardly:

Tokenizer<std::string, 10'000, 250> tokenizer;
tokenizer.init(data);

auto vocabulary = tokenizer.build_vocabulary();

std::string sentence = "this movie is excitingdaring and the music is
very goodthe movie moonwalker was meant to coincide wi"
;

std::vector<int> token_indices = vocabulary(sentence);

// token_indices is {11, 19, 7, 1, 3, 2, 188, 7, 42, 1}

Usually, the next step is using the token_indices array to index word embeddings. Word embeddings are a powerful way to represent words, allowing the retrieval of semantics from data.

Word embeddings are a perfect talk for another story.

Validating the results

Comparing the two approaches, it is clear that sorting the whole token list is slower than using heaps:

In this experiment, we used 3 datasets:

You can check out the source code on GitHub.

Conclusion

Building vocabularies for huge datasets can be performed using heap trees to reduce latency. Modern C++ provides a native implementation of heaps with thepriority_queue container from STL.

Source Code

The code used here can be found on GitHub.

Questions, corrections, or concerns

Can you spot something wrong? Drop a message. Have a question? Drop a message. Do you want to talk about the code or contractions? Post your message. It is welcome and free!

How to contribute

Nobody paid me to write this story. If you enjoyed this tutorial and want to contribute, check a charity institution near you and provide your assistance with money and/or time. No citation is needed. You needn’t let me know either.

Have a question? Drop a message and I will reply as soon as possible.

Regards,
Luiz

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓