Demystifying AI for everyone: Part 2-NLP Vectorization

Last Updated on July 17, 2023 by Editorial Team

Author(s): Himanshu Joshi

Originally published on Towards AI.

In the age of ChatGPT, let’s learn the basics

I have started this Demystifying AI for everyone series to explain the basic building blocks of NLP in layman's language. So that in the age of Chat GPT, everyone understands what are the basic building blocks of such a complex language model.

In this post, I will try to share what Vectorization means in NLP and what are the various Vectorization techniques that are most commonly used.

If you haven't read the first part on NLP Basics, do give it a read.

Part 1-NLP Basics

Vectorization in simple words is nothing but converting words into vector formats so that computers can understand them

Vectorization techniques are used in natural language processing (NLP) to convert textual data into numerical representations that can be processed by machine learning algorithms.

Here are some common vectorization techniques used in NLP:

Bag of Words (BoW):

This technique represents text as a matrix of word counts, where each row corresponds to a document, and each column corresponds to a word in the vocabulary. The values in the matrix represent the frequency of each word in each document.

BoW counts the frequency of each word in a document and creates a vector where each element represents the count of a particular word in the document.

For example, consider there are 3 sentences we are working with

The cat in the hat
The dog is on the street
The bird is in the cage

Before we move ahead, let's define some vocabulary

every sentence is a document & all the sentences make up the corpus

document = individual sentence

corpus = all the documents put together

the sentence (document) “The cat in the hat” would be represented as [1, 1, 1, 0, 0, 0] for the words “the”, “cat”, “in”, “hat”, “dog”, and “bird”, respectively.

Basically, the word present in a document is assigned 1, and the word not present is assigned 0 in the BoW vectorization technique.

Term Frequency-Inverse Document Frequency (TF-IDF):

This technique is similar to BoW, but it weights the word counts by their frequency in the corpus.

Words that frequently occur in a document but infrequently in the corpus are considered more important and are given higher weights.

To explain this in simple words, the word that appears in all sentences will not give as much information as the words that appear in a few sentences right?

For example: “the” appears in all sentences. Does it give any information? Most likely not.

At the same time, “cat/dog/bird” appears only in 1 sentence, so the probability of them providing the information is much much higher than “the”

The formula for Term Frequency-Inverse Document Frequency (TF-IDF) is:

TF-IDF(w,d) = TF(w,d) x IDF(w)

where:

TF(w,d) is the frequency of the word w in document d.
IDF(w) is the inverse document frequency of the word w, calculated as:
IDF(w) = log(N / n)
where N is the total number of documents in the corpus and n is the number of documents in the corpus that contain the word w.

So if a word is present in all documents, IDF value becomes 0 (log(1)). The word is given 0 importance in short.

Word Embeddings:

This technique represents words as dense vectors in a high-dimensional space, where the distance between vectors represents the semantic similarity between the corresponding words. Popular word embedding models include Word2Vec and GloVe.

Word embeddings are learned through a process called training, which involves mapping words from a corpus to dense, low-dimensional vectors. One common method for learning word embeddings is the skip-gram model with negative sampling (SGNS)

Example Word representations are as follows,

“dog” = [0.1548, 0.4848, …, 1.864]

“cat” = [0.8785, 0.8974, …, 2.794]

The most important feature of word embeddings is that similar words in a semantic sense have a smaller distance (either Euclidean, cosine, or other) between them than words that have no semantic relationship. For example, words like “cat” and “dog” should be closer together than the words “cat” and “street” or “dog” and “hat”.

Word embeddings are created using a neural network with one input layer, one hidden layer, and one output layer.

In practice, we normally use pre-trained word embeddings as they are trained on a large corpus of data from Wikipedia or elsewhere.

Character Embeddings:

This technique represents words as sequences of characters and learns a vector representation for each character. These character vectors are then combined to form a vector representation of the word.

Contextualized Word Embeddings:

These are word embeddings that take into account the context in which a word appears. Popular models include BERT, GPT-2, and XLNet.

Subword Embeddings:

These embeddings represent words as a sequence of subwords, which can capture morphological information and handle out-of-vocabulary words. Popular models include FastText and Byte Pair Encoding (BPE).

These vectorization techniques have different strengths and weaknesses and are suitable for different NLP tasks. It’s important to choose the right technique for your specific use case.

For example, if you have a very small corpus, then it might be better if we try TF-IDF. If we have a huge corpus, then maybe Word Embeddings is a better way to go about it.

Again all the techniques have dedicated libraries, and we won’t have to code anything from scratch (most of the time)

Hope you enjoyed this post; I have tried to explain it in a very simple manner.

I will be sharing more NLP concepts in the upcoming parts of this series.

If you liked this post do consider following me for similar content and share your thoughts.

All the best for your journey. Onwards and Upwards people…

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Demystifying AI for everyone: Part 2-NLP Vectorization

Author(s): Himanshu Joshi

In the age of ChatGPT, let’s learn the basics

Bag of Words (BoW):

Term Frequency-Inverse Document Frequency (TF-IDF):

Word Embeddings:

Character Embeddings:

Contextualized Word Embeddings:

Subword Embeddings:

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

TAI #143: New Scaling Laws Incoming? Ilya’s SSI Raises at $30bn, Manus Takes AI Agents Mainstream

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Demystifying AI for everyone: Part 2-NLP Vectorization

Author(s): Himanshu Joshi

In the age of ChatGPT, let’s learn the basics

Bag of Words (BoW):

Term Frequency-Inverse Document Frequency (TF-IDF):

Word Embeddings:

Character Embeddings:

Contextualized Word Embeddings:

Subword Embeddings:

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement