Demystifying AI for everyone: Part 2-NLP Vectorization
Last Updated on July 17, 2023 by Editorial Team
Author(s): Himanshu Joshi
Originally published on Towards AI.
In the age of ChatGPT, letβs learn the basics
I have started this Demystifying AI for everyone series to explain the basic building blocks of NLP in layman's language. So that in the age of Chat GPT, everyone understands what are the basic building blocks of such a complex language model.
In this post, I will try to share what Vectorization means in NLP and what are the various Vectorization techniques that are most commonly used.
If you haven't read the first part on NLP Basics, do give it a read.
Vectorization in simple words is nothing but converting words into vector formats so that computers can understand them
Vectorization techniques are used in natural language processing (NLP) to convert textual data into numerical representations that can be processed by machine learning algorithms.
Here are some common vectorization techniques used in NLP:
Bag of Words (BoW):
This technique represents text as a matrix of word counts, where each row corresponds to a document, and each column corresponds to a word in the vocabulary. The values in the matrix represent the frequency of each word in each document.
BoW counts the frequency of each word in a document and creates a vector where each element represents the count of a particular word in the document.
For example, consider there are 3 sentences we are working with
- The cat in the hat
- The dog is on the street
- The bird is in the cage
Before we move ahead, let's define some vocabulary
every sentence is a document & all the sentences make up the corpus
document = individual sentence
corpus = all the documents put together
the sentence (document) βThe cat in the hatβ would be represented as [1, 1, 1, 0, 0, 0] for the words βtheβ, βcatβ, βinβ, βhatβ, βdogβ, and βbirdβ, respectively.
Basically, the word present in a document is assigned 1, and the word not present is assigned 0 in the BoW vectorization technique.
Term Frequency-Inverse Document Frequency (TF-IDF):
This technique is similar to BoW, but it weights the word counts by their frequency in the corpus.
Words that frequently occur in a document but infrequently in the corpus are considered more important and are given higher weights.
To explain this in simple words, the word that appears in all sentences will not give as much information as the words that appear in a few sentences right?
For example: βtheβ appears in all sentences. Does it give any information? Most likely not.
At the same time, βcat/dog/birdβ appears only in 1 sentence, so the probability of them providing the information is much much higher than βtheβ
The formula for Term Frequency-Inverse Document Frequency (TF-IDF) is:
TF-IDF(w,d) = TF(w,d) x IDF(w)
where:
- TF(w,d) is the frequency of the word w in document d.
- IDF(w) is the inverse document frequency of the word w, calculated as:
- IDF(w) = log(N / n)
- where N is the total number of documents in the corpus and n is the number of documents in the corpus that contain the word w.
So if a word is present in all documents, IDF value becomes 0 (log(1)). The word is given 0 importance in short.
Word Embeddings:
This technique represents words as dense vectors in a high-dimensional space, where the distance between vectors represents the semantic similarity between the corresponding words. Popular word embedding models include Word2Vec and GloVe.
Word embeddings are learned through a process called training, which involves mapping words from a corpus to dense, low-dimensional vectors. One common method for learning word embeddings is the skip-gram model with negative sampling (SGNS)
Example Word representations are as follows,
βdogβ = [0.1548, 0.4848, β¦, 1.864]
βcatβ = [0.8785, 0.8974, β¦, 2.794]
The most important feature of word embeddings is that similar words in a semantic sense have a smaller distance (either Euclidean, cosine, or other) between them than words that have no semantic relationship. For example, words like βcatβ and βdogβ should be closer together than the words βcatβ and βstreetβ or βdogβ and βhatβ.
Word embeddings are created using a neural network with one input layer, one hidden layer, and one output layer.
In practice, we normally use pre-trained word embeddings as they are trained on a large corpus of data from Wikipedia or elsewhere.
Character Embeddings:
This technique represents words as sequences of characters and learns a vector representation for each character. These character vectors are then combined to form a vector representation of the word.
Contextualized Word Embeddings:
These are word embeddings that take into account the context in which a word appears. Popular models include BERT, GPT-2, and XLNet.
Subword Embeddings:
These embeddings represent words as a sequence of subwords, which can capture morphological information and handle out-of-vocabulary words. Popular models include FastText and Byte Pair Encoding (BPE).
These vectorization techniques have different strengths and weaknesses and are suitable for different NLP tasks. Itβs important to choose the right technique for your specific use case.
For example, if you have a very small corpus, then it might be better if we try TF-IDF. If we have a huge corpus, then maybe Word Embeddings is a better way to go about it.
Again all the techniques have dedicated libraries, and we wonβt have to code anything from scratch (most of the time)
Hope you enjoyed this post; I have tried to explain it in a very simple manner.
I will be sharing more NLP concepts in the upcoming parts of this series.
If you liked this post do consider following me for similar content and share your thoughts.
All the best for your journey. Onwards and Upwards peopleβ¦
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI