Bag-of-Words(BOW)
Last Updated on January 27, 2022 by Editorial Team
Author(s): vivek
Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.
Natural Language Processing
In the previous blog, we have extensively discussed the need to convert text to vector to perform machine learning algorithms, so that meaningful insights can be drawn from the text data. Text data is the major form of data available in the current world from many industries like Healthcare, Travel, Ecommerce, etcβ¦
So, it is highly desirable to process and analyze the insights from the text data which may help in faster decision-making /assistance to decision-making in the healthcare industry, making the journey comfortable with the help of passengers from their reviews in the travel industry.
Bag-of-Words is a very simple strategy that converts text to the numeric vector which helps machine learning algorithms to perform on the textΒ data.
The main objective of converting text to numeric vector is Semantically Similar (SS) the text must be closer (closerΒ vector).
Before getting into the blog letβs have a look at basic terminology used inΒ NLP
CorpusβββIt is a file containing all the documents.
DocumentβββDocument is generally a review given by a user to a specific product orΒ service.
BOW:
Letβs discuss the concept with an example, A publisher asked the readers to post a review after completion of the book. So, readers posted their reviews which helps to make necessary changes in the next edition. The following is the corpus containing around n documents(reviews) posted by readers. Letβs have a look at some of the documents in the corpus. This is considered stepΒ one.
R1: This book is very informative and expensive.
R2: This book is not informative and expensive.
R3: This book is informative andΒ costly.
R4: Book is informative and the book is expensive.
The second step in the bag of words is to design a list of all the unique words(dictionary) or construct a set of all the uniqueΒ words.
From the above documents(reviews) the set of all the unique words are asΒ follows:
This book, is, very, informative, and, expensive, not,Β costly.
Let us assume we have d-Unique words across all the documents.
Moving forward after constructing a set of all the unique words, step three is the construction of the vector with the help of unique words present in theΒ corpus.
V1, V2, V3, V4 are the vectors to the reviews R1, R2, R3, and R4 respectively.
Constructing vector V1 for the reviewΒ R1.
R1: This book is very informative and expensive.
Now we need to construct a vector of size d, so that all the unique words correspond to a different dimension V1 belongs to R^d(real coordinate space of d dimension) as it is d-dimensional, each word corresponds to one index in the dictionary.
The vector V1 is a d-dimensional vector i.e., it contains all the d-unique words from the corpus. The number of times a word appears in R1 is given count and is filled in the corresponding cell of V1. In the above figure, This appears only once so the count (number of times the word occurs) is 1, the word costly is not present in R1 so the cell corresponding to it is filled by 0. This way the whole vector V1 isΒ filled.
As V1 is d-dimensional i.e., very large most of the cells are filled with zero, this condition leads to a sparse vector. A sparse vector is a vector in which most of the cells /elements of the vector areΒ zero.
This strategy of converting text data to a numeric vector is called Bag-of-Words.
Working of Bag-of-Words:
Letβs take R1 andΒ R2,
R1: This book is very informative and expensive.
R2: This book is not informative and expensive.
Length(V1-V2) = ||V1-V2||
||V1-V2|| is normΒ form
So, from the above FigΒ 2,
There are only two different words present one is Very and the other is not at these vectors so the difference at both these we get is |1β0| =Β 1.
||V1-V2|| = sqrt(1^2 + 1^2) =Β sqrt(2).
From the above, we can observe that sqrt(2) is the distance between the two vectors which is veryΒ small.
But we can see that the two reviews are completely opposite except they both are saying the book is expensive, but one is saying it as very informative and the other is saying not informative.
So, we can say that the bag of words doesnβt perform well when there is a small change in the words used in the document.
Binary Bag-of-Words
There is another strategy with a slight variation called Binary Bag-of-Words. It works as follows, the only difference to Bag-of-Words is instead of counting the number of occurrences of the word it marks one if the word occurs and if the word didnβt occur it marks zero in the vector. This is the only variation between Bag-of-Words and binary BoW. It is also called Boolean Bag-of-Words. So, binary BoW is nothing but the number of different words in the vectors (V1 and V2 for the above example).
||V1-V2|| =Β sqrt(2).
Limitations of Bag-of-Words:
The BoW model is very simple to grasp and to work with great success on reading documents, it has its own limitations.
1) Construction of dictionaries needs careful attention, because of the large sizes of the corpus available in real-world scenarios.
2) Sparsity is very common as the vectors are of size d-dimension (very large) presenting this small information in large sparse vectors is a challenge and reasons like space and time complexity during computation.
3) The model is solely based on the words, ignores the word order and the context of the word, which eliminates the flexibility to explain the meaning of the same word in anotherΒ context.
Thank you, Happy learningΒ people.
Bag-of-Words(BOW) was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Join thousands of data leaders on the AI newsletter. Itβs free, we donβt spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI