Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Demystifying AI for everyone: Part 2-NLP Vectorization
Latest   Machine Learning

Demystifying AI for everyone: Part 2-NLP Vectorization

Last Updated on July 17, 2023 by Editorial Team

Author(s): Himanshu Joshi

Originally published on Towards AI.

In the age of ChatGPT, let’s learn the basics

I have started this Demystifying AI for everyone series to explain the basic building blocks of NLP in layman's language. So that in the age of Chat GPT, everyone understands what are the basic building blocks of such a complex language model.

In this post, I will try to share what Vectorization means in NLP and what are the various Vectorization techniques that are most commonly used.

Photo by Amador Loureiro on Unsplash

If you haven't read the first part on NLP Basics, do give it a read.

Part 1-NLP Basics

Vectorization in simple words is nothing but converting words into vector formats so that computers can understand them

Vectorization techniques are used in natural language processing (NLP) to convert textual data into numerical representations that can be processed by machine learning algorithms.

Here are some common vectorization techniques used in NLP:

Bag of Words (BoW):

This technique represents text as a matrix of word counts, where each row corresponds to a document, and each column corresponds to a word in the vocabulary. The values in the matrix represent the frequency of each word in each document.

BoW counts the frequency of each word in a document and creates a vector where each element represents the count of a particular word in the document.

For example, consider there are 3 sentences we are working with

  1. The cat in the hat
  2. The dog is on the street
  3. The bird is in the cage

Before we move ahead, let's define some vocabulary

every sentence is a document & all the sentences make up the corpus

document = individual sentence

corpus = all the documents put together

the sentence (document) β€œThe cat in the hat” would be represented as [1, 1, 1, 0, 0, 0] for the words β€œthe”, β€œcat”, β€œin”, β€œhat”, β€œdog”, and β€œbird”, respectively.

Basically, the word present in a document is assigned 1, and the word not present is assigned 0 in the BoW vectorization technique.

Term Frequency-Inverse Document Frequency (TF-IDF):

This technique is similar to BoW, but it weights the word counts by their frequency in the corpus.

Words that frequently occur in a document but infrequently in the corpus are considered more important and are given higher weights.

To explain this in simple words, the word that appears in all sentences will not give as much information as the words that appear in a few sentences right?

For example: β€œthe” appears in all sentences. Does it give any information? Most likely not.

At the same time, β€œcat/dog/bird” appears only in 1 sentence, so the probability of them providing the information is much much higher than β€œthe”

The formula for Term Frequency-Inverse Document Frequency (TF-IDF) is:

TF-IDF(w,d) = TF(w,d) x IDF(w)

where:

  • TF(w,d) is the frequency of the word w in document d.
  • IDF(w) is the inverse document frequency of the word w, calculated as:
  • IDF(w) = log(N / n)
  • where N is the total number of documents in the corpus and n is the number of documents in the corpus that contain the word w.

So if a word is present in all documents, IDF value becomes 0 (log(1)). The word is given 0 importance in short.

Word Embeddings:

This technique represents words as dense vectors in a high-dimensional space, where the distance between vectors represents the semantic similarity between the corresponding words. Popular word embedding models include Word2Vec and GloVe.

Word embeddings are learned through a process called training, which involves mapping words from a corpus to dense, low-dimensional vectors. One common method for learning word embeddings is the skip-gram model with negative sampling (SGNS)

Example Word representations are as follows,

β€œdog” = [0.1548, 0.4848, …, 1.864]

β€œcat” = [0.8785, 0.8974, …, 2.794]

The most important feature of word embeddings is that similar words in a semantic sense have a smaller distance (either Euclidean, cosine, or other) between them than words that have no semantic relationship. For example, words like β€œcat” and β€œdog” should be closer together than the words β€œcat” and β€œstreet” or β€œdog” and β€œhat”.

Word embeddings are created using a neural network with one input layer, one hidden layer, and one output layer.

In practice, we normally use pre-trained word embeddings as they are trained on a large corpus of data from Wikipedia or elsewhere.

Character Embeddings:

This technique represents words as sequences of characters and learns a vector representation for each character. These character vectors are then combined to form a vector representation of the word.

Contextualized Word Embeddings:

These are word embeddings that take into account the context in which a word appears. Popular models include BERT, GPT-2, and XLNet.

Subword Embeddings:

These embeddings represent words as a sequence of subwords, which can capture morphological information and handle out-of-vocabulary words. Popular models include FastText and Byte Pair Encoding (BPE).

These vectorization techniques have different strengths and weaknesses and are suitable for different NLP tasks. It’s important to choose the right technique for your specific use case.

For example, if you have a very small corpus, then it might be better if we try TF-IDF. If we have a huge corpus, then maybe Word Embeddings is a better way to go about it.

Again all the techniques have dedicated libraries, and we won’t have to code anything from scratch (most of the time)

Hope you enjoyed this post; I have tried to explain it in a very simple manner.

I will be sharing more NLP concepts in the upcoming parts of this series.

If you liked this post do consider following me for similar content and share your thoughts.

All the best for your journey. Onwards and Upwards people…

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓