Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Demystifying AI for everyone: Part 2-NLP Vectorization
Latest   Machine Learning

Demystifying AI for everyone: Part 2-NLP Vectorization

Last Updated on July 17, 2023 by Editorial Team

Author(s): Himanshu Joshi

Originally published on Towards AI.

In the age of ChatGPT, let’s learn the basics

I have started this Demystifying AI for everyone series to explain the basic building blocks of NLP in layman's language. So that in the age of Chat GPT, everyone understands what are the basic building blocks of such a complex language model.

In this post, I will try to share what Vectorization means in NLP and what are the various Vectorization techniques that are most commonly used.

Photo by Amador Loureiro on Unsplash

If you haven't read the first part on NLP Basics, do give it a read.

Part 1-NLP Basics

Vectorization in simple words is nothing but converting words into vector formats so that computers can understand them

Vectorization techniques are used in natural language processing (NLP) to convert textual data into numerical representations that can be processed by machine learning algorithms.

Here are some common vectorization techniques used in NLP:

Bag of Words (BoW):

This technique represents text as a matrix of word counts, where each row corresponds to a document, and each column corresponds to a word in the vocabulary. The values in the matrix represent the frequency of each word in each document.

BoW counts the frequency of each word in a document and creates a vector where each element represents the count of a particular word in the document.

For example, consider there are 3 sentences we are working with

  1. The cat in the hat
  2. The dog is on the street
  3. The bird is in the cage

Before we move ahead, let's define some vocabulary

every sentence is a document & all the sentences make up the corpus

document = individual sentence

corpus = all the documents put together

the sentence (document) “The cat in the hat” would be represented as [1, 1, 1, 0, 0, 0] for the words “the”, “cat”, “in”, “hat”, “dog”, and “bird”, respectively.

Basically, the word present in a document is assigned 1, and the word not present is assigned 0 in the BoW vectorization technique.

Term Frequency-Inverse Document Frequency (TF-IDF):

This technique is similar to BoW, but it weights the word counts by their frequency in the corpus.

Words that frequently occur in a document but infrequently in the corpus are considered more important and are given higher weights.

To explain this in simple words, the word that appears in all sentences will not give as much information as the words that appear in a few sentences right?

For example: “the” appears in all sentences. Does it give any information? Most likely not.

At the same time, “cat/dog/bird” appears only in 1 sentence, so the probability of them providing the information is much much higher than “the”

The formula for Term Frequency-Inverse Document Frequency (TF-IDF) is:

TF-IDF(w,d) = TF(w,d) x IDF(w)

where:

  • TF(w,d) is the frequency of the word w in document d.
  • IDF(w) is the inverse document frequency of the word w, calculated as:
  • IDF(w) = log(N / n)
  • where N is the total number of documents in the corpus and n is the number of documents in the corpus that contain the word w.

So if a word is present in all documents, IDF value becomes 0 (log(1)). The word is given 0 importance in short.

Word Embeddings:

This technique represents words as dense vectors in a high-dimensional space, where the distance between vectors represents the semantic similarity between the corresponding words. Popular word embedding models include Word2Vec and GloVe.

Word embeddings are learned through a process called training, which involves mapping words from a corpus to dense, low-dimensional vectors. One common method for learning word embeddings is the skip-gram model with negative sampling (SGNS)

Example Word representations are as follows,

“dog” = [0.1548, 0.4848, …, 1.864]

“cat” = [0.8785, 0.8974, …, 2.794]

The most important feature of word embeddings is that similar words in a semantic sense have a smaller distance (either Euclidean, cosine, or other) between them than words that have no semantic relationship. For example, words like “cat” and “dog” should be closer together than the words “cat” and “street” or “dog” and “hat”.

Word embeddings are created using a neural network with one input layer, one hidden layer, and one output layer.

In practice, we normally use pre-trained word embeddings as they are trained on a large corpus of data from Wikipedia or elsewhere.

Character Embeddings:

This technique represents words as sequences of characters and learns a vector representation for each character. These character vectors are then combined to form a vector representation of the word.

Contextualized Word Embeddings:

These are word embeddings that take into account the context in which a word appears. Popular models include BERT, GPT-2, and XLNet.

Subword Embeddings:

These embeddings represent words as a sequence of subwords, which can capture morphological information and handle out-of-vocabulary words. Popular models include FastText and Byte Pair Encoding (BPE).

These vectorization techniques have different strengths and weaknesses and are suitable for different NLP tasks. It’s important to choose the right technique for your specific use case.

For example, if you have a very small corpus, then it might be better if we try TF-IDF. If we have a huge corpus, then maybe Word Embeddings is a better way to go about it.

Again all the techniques have dedicated libraries, and we won’t have to code anything from scratch (most of the time)

Hope you enjoyed this post; I have tried to explain it in a very simple manner.

I will be sharing more NLP concepts in the upcoming parts of this series.

If you liked this post do consider following me for similar content and share your thoughts.

All the best for your journey. Onwards and Upwards people…

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓