Natural Language Clustering — Part 1

Last Updated on January 6, 2023 by Editorial Team

Author(s): Francesco Fumagalli

Natural Language Processing

Natural Language Clustering — Part 1

Help chatbots deal with FAQs, with Python code for Tokenization, GloVe, and TF-IDF

Classifying things comes quite natural to us: our books, movies and music all have genres; the things we study are split between different subjects and even the food we eat belongs to different cuisines!
In recent years we’ve been able to develop better and better algorithms to classify text: models like BERT-ITPT-FiT (BERT + withIn-Task Pre-Training + Fine-Tuning) or XL-NET seem to be reigning champions in this category, at least in the 29 benchmark datasets available on PapersWithCode.

In recent years we’ve been able to develop better and better algorithms to classify text: models like BERT-ITPT-FiT (BERT + withIn-Task Pre-Training + Fine-Tuning) or XL-NET seem to be reigning champions in this category, at least in the 29 benchmark datasets available on PapersWithCode.

But what if we don’t know the available categories for the texts we want to analyze? Take for example a corpus of conversations or a collection of books or articles that all belong to different specializations within the same subject: labels aren’t always as clear cut as spam / not spam, we may not have any idea of how many or what kind of labels to expect, or normal pre-trained classification methods wouldn’t have the in-depth domain knowledge required not to classify them as all the same, while not enough material, time or computational power is available to fine-tune a Transformer model.

Perhaps one of the most obvious examples is categorizing incoming messages into FAQs: not every message can be answered via a FAQ, but we’d like to associate as many as possible to the right answer, to minimize the amount we leave unanswered or address manually, and dynamically improve our FAQs adding new questions that become relevant.

Let’s see how!

From words to numerical values: Tokenization and Embedding

The first thing we have to do, as always when dealing with Natural Language, is transforming the text into something our algorithms can understand. Paragraphs and sentences are split into their basic components, words, in a process called tokenization. But words themselves can’t be fed to a computational model since mathematical operations can’t be applied to them, which explains the need to have them mapped onto vectors of real numbers: basically transformed into sequences of numbers which represent them in a mathematical space, also preserving the relationships between them (a process known as word embedding, a term coming from mathematics to identify an injective and structure-preserving map).

It sounds complex, but there are simple ways to do this, such as the one shown below: starting from three sentences (on the left) their tokens are extracted and a very simple embedding is performed (on the right) by counting the occurrences of each token (word) in the original texts

Tokenization and a simple Count Vector Embedding from three short texts.

In the example above the word startup is embedded as [1,0,1] because it appears once in the first and third text but not in the second one, whereas the word “in” is embedded as [0,2,0] because it only appears in the second sentence, twice.

Note that it’s not a very practical example: given the small number of inputs different words end up having the same embedding (for example “specialized” and “in” both appear in the second text only, so they share the same embedding [0,1,0]), but we hope it helps you get a better understanding of how the process works.

Also note that the tokens are all lowercase and the punctuation has been removed: that’s common practice, although it depends on the kind of text we’re dealing with.

Another common practice not shown in the example above is the removal of stopwords, words so common they aren’t relevant for our purpose and very likely to be found an abundant number of times in most texts we come across (such as “a”, “the” or “is”).

This approach, however, provides the advantage of embedding documents themselves at the same time as words: reading the table vertically, Text1 can be read as [1,1,1,1,1,1,0,0,0,0]. This makes it easy to confront different sources by introducing a distance metric.

Tokenization Code Example in Python using nltk:

# !pip install nltk

from nltk.tokenize import word_tokenize

text = "Digitiamo is a Startup from Italy"

tokenized_text = nltk.word_tokenize(text)

print(tokenized_text)

# OUTPUT:
# ['Digitiamo', 'is', 'a', 'Startup', 'from', 'Italy']

Embedding Code Example in Python using gensim’s Word2Vec

# !pip install --upgrade gensim

from gensim.models import Word2Vec
model = Word2Vec([tokenized_text], min_count=1)
"""
 min_count (int, optional) – Ignores all words with total frequency lower than this.

Word2Vec is optimized to work with multiple texts at the same time in the form of a list of texts. When working with a single text, it should be put into a list, which explains the square brackets around tokenized_text 
"""

print(list(model.wv.vocab))

# OUTPUT:
# ['Digitiamo', 'is', 'a', 'Startup', 'from', 'Italy']

Showing the embedding for a word. Word2Vec’s default parameters use a 100-dimensional embedding

Commonly used Embeddings: TF-IDF and GloVe

TF-IDF stands for Term Frequency–Inverse Document Frequency and it emphasizes the importance of each term based on how many times it appears in other texts from the corpus. It can be very useful to embed texts within a specialized domain for the purpose of clusterization. The words that appear in every document, no matter how infrequent they are in everyday usage, will be given very little weight, whereas words that only appear in some documents will be given more weight, making it easier to recognize possible clusters.
For example, if we were trying to embed a bunch of papers about endocrinology, the word “endocrinology” itself would likely be mentioned multiple times in each paper, making its presence not very helpful for the purpose of distinguishing them.
GloVe (Global Vectors for word representation) is an unsupervised algorithm that links words based on how often they occur together, trying to map the underlying meaning. You can see some examples of relationships between words obtained through GloVe in the image below: “man” is approximately as distant from “woman” as “king” is from “queen” going in the same direction, as they have the same meaning for different genders (and vice versa, of course). The same happens for comparatives and superlatives in the fourth panel: the vector that takes you from “clear” to “clearer”, if applied to “soft”, brings you to “softer”.
Depending on the size of your corpus the Mittens library could be a good complement to GloVe: according to their GitHub, it’s “useful for domains that require specialized representations but lack sufficient data to train them from scratch. Mittens starts with the general-purpose pretrained representations and tunes them to a specialized domain”, it’s also compatible with Numpy and TensorFlow.

(image from the official GloVe source: https://nlp.stanford.edu/projects/glove/ )

Excellent! Now that we’ve transformed our words into numerical vectors and mapped them onto a space, if we’ve used an embedding that only works for words and not full texts it’s time to go back to sentences: remember our original goal? We don’t just want to map words, but try to cluster the texts or sentences they’re in.

There are many way to do this and some may work better than others, depending on the context. Perhaps the easiest one, that’s also quite effective for short text, would be to average the embedding of each word in a text to generate the embedding of the whole text.

For longer material another approach is to extract the keywords and then average the embedding of those, to be sure your average isn’t diluted by a myriad of common words, but that requires a keyword extraction algorithm which may not always be available.

Tf-Idf Code Example in Python using sklearn

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'Digitiamo is a Startup from Italy',
    'Digitiamo is specialized in NLP in Italy',
    'Italy is not a Startup'
]

vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names())
print(X.shape)

# OUTPUT: 
# ['digitiamo', 'from', 'in', 'is', 'italy', 'nlp', 'not',  'specialized', 'startup']
# (3, 9)

# get_feature_names returns the dictionary of words, whereas X.shape # is 3x9: 3 documents, 9 different words. The word ‘a’ has been stopped.

The resulting trained model. Word are printed backwards and can be seen in the get_feature_names list above, so first row (0,4) means the last word of the first text, which is the fifth (4+1, we count from zero) in the list: “italy”. (0,1) is “from”; (0,8) “startup” and so on.

Using GloVe in Python (source)

Once you’ve downloaded a GloVe embedding from the official source, you can use it as shown below (credits to Karishma Malkan)

import numpy as np

def loadGloveModel(File):
    print("Loading Glove Model")
    f = open(File,'r')
    gloveModel = {}
    for line in f:
        splitLines = line.split()
        word = splitLines[0]
        wordEmbedding = np.array([float(value) for value in splitLines[1:]])
        gloveModel[word] = wordEmbedding
    print(len(gloveModel)," words loaded!")
    return gloveModel

You can then access the word vectors by simply using the gloveModel variable.

print gloveModel['hello']

Alternatively, you can load the file using Pandas as shown below (credit to Petter)

import pandas as pd
import csv

words = pd.read_table(glove_data_file, sep=" ", index_col=0, header=None, quoting=csv.QUOTE_NONE)

Then to get the vector for a word:

def vec(w):
  return words.loc[w].as_matrix()

And to find the closest word to a vector:

words_matrix = words.as_matrix()

def find_closest_word(v):
  diff = words_matrix - v
  delta = np.sum(diff * diff, axis=1)
  i = np.argmin(delta)
  return words.iloc[i].name

The file can also be opened using gensim as shown below (credits to Ben), perhaps the best method as it’s consistent with what other methods we’ve seen before in this article and allows you touse gensim word2vec methods (for example, similarity)

Use glove2word2vec to convert GloVe vectors in text format into the word2vec text format:

from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(glove_input_file="vectors.txt", word2vec_output_file="gensim_glove_vectors.txt")

Finally, read the word2vec txt to a gensim model using KeyedVectors:

from gensim.models.keyedvectors import KeyedVectors
glove_model = KeyedVectors.load_word2vec_format("gensim_glove_vectors.txt", binary=False)

Averaging word embeddings to get text embeddings

import re
from gensim.models import Word2Vec

text = 'Digitiamo is a Startup from Italy'

tokenized_text = nltk.word_tokenize(text)

model = Word2Vec([tokenized_text], min_count=1)

modelledText = []
for word in text.split():
    word = re.sub(r'[^ws]','',word)
    # remove punctuation
    modelledText.append(model[word])

embedText = sum(modelledText) / len(modelledText)
# The embeddings are vectors, so the can be added and divided

print(embedText)

The resulting (averaged) embedded vector for the whole sentence

Now that we’ve finally gotten the embedding of our texts, we can cluster them as if they were normal numerical data entries!

To find out more about some of the available clustering techniques read the second part of this article, available soon! If you want to be notified when it gets published, subscribe to our newsletter by sliding the tool below (just like an iPhone unlock)

By leveraging and refining the techniques explained in this article we’ve developed AiKnowYou, a product that analyzes and improves the performance of chatbots over time. Feel free to contact us if you’d like to get a better understanding of how it works, or if you’d like to improve the performance of your own chatbots today

Thank you for reading!

About Digitiamo

Digitiamo is a start-up from Italy focused on using AI to help companies manage and leverage their knowledge. To find out more, visit us.

About the Authors

Fabio Chiusano is the Head of Data Science at Digitiamo; Francesco Fumagalli is an aspiring data scientist doing a R&D internship.

Natural Language Clustering — Part 1 was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Natural Language Clustering — Part 1

Author(s): Francesco Fumagalli

Natural Language Processing

Natural Language Clustering — Part 1

Help chatbots deal with FAQs, with Python code for Tokenization, GloVe, and TF-IDF

From words to numerical values: Tokenization and Embedding

Tokenization Code Example in Python using nltk:

Embedding Code Example in Python using gensim’s Word2Vec

Commonly used Embeddings: TF-IDF and GloVe

Tf-Idf Code Example in Python using sklearn

Using GloVe in Python (source)

Averaging word embeddings to get text embeddings

About Digitiamo

About the Authors

Towards AI Team

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Mastering Generative AI Architectural Patterns: A Comprehensive Guide

How Far Is AI Capable of Delivering on Its Promises and Changing Our Civilization?

Advanced Hallucination Mitigation Techniques in LLMs – RAG, knowledge editing, contrastive decoding, self-refinement, uncertainty-aware beam search

7 Counterintuitive and Non-intuitive Probability Problems

TAI 134: The US Reveals Its New Regulations for the Diffusion of Advanced AI

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Natural Language Clustering — Part 1

Author(s): Francesco Fumagalli

Natural Language Clustering — Part 1

Help chatbots deal with FAQs, with Python code for Tokenization, GloVe, and TF-IDF

From words to numerical values: Tokenization and Embedding

Tokenization Code Example in Python using nltk:

Embedding Code Example in Python using gensim’s Word2Vec

Commonly used Embeddings: TF-IDF and GloVe

Tf-Idf Code Example in Python using sklearn

Using GloVe in Python (source)

Averaging word embeddings to get text embeddings

About Digitiamo

About the Authors

Towards AI Team

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement