Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Unlocking the Potential of Text: A Closer Look at Pre-Embedding Text Cleaning Methods
Data Science   Latest   Machine Learning

Unlocking the Potential of Text: A Closer Look at Pre-Embedding Text Cleaning Methods

Last Updated on August 1, 2023 by Editorial Team

Author(s): Shivamshinde

Originally published on Towards AI.

This article will discuss different cleaning techniques that are essential to obtain maximum performance from textual data.

Photo by Amador Loureiro on Unsplash

For the demonstration of the text cleaning methods, we will use the text dataset named ‘metamorphosis’ from Kaggle.

Let’s start with importing the required Python libraries for the cleaning process.

import nltk, re, string
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

Now let’s load the dataset.

file_directory = 'link-to-the-dataset-local-directory'
file = open(file_directory, 'rt', encoding='utf-8')
text = file.read()
file.close()

note that for the above code cell to work, you need to put the local directory path of the data file.

Splitting the text data into words by whitespace

words = text.split()
print(words[:120])

Here we are seeing that the punctuation is preserved (e.g. armour-like and wasn’t), which is nice. We can also see that end-of-sentence punctuation is kept with the last word (e.g., thought.), which is not great.

So, this time let’s try splitting the data using non-word characters.

words = re.split(r'\W+', text)
print(words[:120])

Here we see that words like ‘thought.’ have been converted into ‘thought’. But the problem is that the words like ‘wasn’t’ are converted into two words like ‘wasn’ and ‘t’. We need to fix it.

In Python, we can use string.punctuation to get a bunch of punctuations at once. We will use that to remove punctuation from our text

print(string.punctuation)

So now we will split the words by whitespace and then remove all the punctuations which have been recorded in the data

words = text.split()
re_punc = re.compile('[%s]' % re.escape(string.punctuation))
stripped = [re_punc.sub('', word) for word in words]
print(stripped[:120])

Here we can see that we don’t have the words like ‘thought.’ but we also have words like ‘wasn’t’ which is correct.

Sometimes the text also contains characters that are not printable. We need to filter those out too. To do this, we can use Python ‘string.printable’ which gives us a bunch of characters that can be printed. So, we will remove the characters which are not present in this list.

re_print = re.compile('[^%s]' % re.escape(string.printable))
result = [re_print.sub('', word) for word in stripped]
print(result[:120])

Now let’s make all the words into lowercase. This will reduce our vocabulary. But this has some disadvantages also. After doing this, two words such as ‘Apple’ as in company and ‘apple’ as a fruit will be considered the same entity.

result = [word.lower() for word in result]
print(result[:120])

Also, words with one character won’t contribute to most of the NLP tasks. So we will be removing those too.

result = [word for word in result if len(word) > 1]
print(result[:120])

In NLP, frequent words such as ‘is’, ‘as’, ‘the’ do not contribute much to the model training. So such words are known as stop words and removing them in the text-cleaning process is suggested.

import nltk
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
print(stop_words)
result = [word for word in result if word not in set(stop_words)]
print(words[:110])

Now, at this point, we will reduce words with the same intent to a single word. For example, we will reduce the words ‘running’, ‘run’, and ‘runs’ to the word ‘run’ only since all three words give the same meaning to the model during training. This can be performed using PorterStemmer class in nltk library.

from nltk.stem.porter import PorterStemmer

ps = PorterStemmer()

result = [ps.stem(word) for word in result]
print(words[:110])

Stemmed words may or may not have a meaning. If you want your words to have a meaning, then rather than using the stemming technique, you can use a lemmatization technique which guarantees that the words will have meaning after transformation.

Now let’s remove the words which are not made of alphabets alone

result = [word for word in result if word.isalpha()]
print(result[:110])

At this stage, the textual data seems decent enough to be used for the word embedding techniques. But also note that there might be some additional steps to this process for some special kinds of data (for example, HTML code).

I hope you like the article. If you have any thoughts on the article then please let me know. Any constructive feedback is highly appreciated.

Connect with me on LinkedIn.

Mail me at [email protected]

Have a great day!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓