Unlocking the Potential of Text: A Closer Look at Pre-Embedding Text Cleaning Methods

Last Updated on August 1, 2023 by Editorial Team

Author(s): Shivamshinde

Originally published on Towards AI.

This article will discuss different cleaning techniques that are essential to obtain maximum performance from textual data.

For the demonstration of the text cleaning methods, we will use the text dataset named ‘metamorphosis’ from Kaggle.

Let’s start with importing the required Python libraries for the cleaning process.

import nltk, re, string
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

Now let’s load the dataset.

file_directory = 'link-to-the-dataset-local-directory'
file = open(file_directory, 'rt', encoding='utf-8')
text = file.read()
file.close()

note that for the above code cell to work, you need to put the local directory path of the data file.

Splitting the text data into words by whitespace

words = text.split()
print(words[:120])

Here we are seeing that the punctuation is preserved (e.g. armour-like and wasn’t), which is nice. We can also see that end-of-sentence punctuation is kept with the last word (e.g., thought.), which is not great.

So, this time let’s try splitting the data using non-word characters.

words = re.split(r'\W+', text)
print(words[:120])

Here we see that words like ‘thought.’ have been converted into ‘thought’. But the problem is that the words like ‘wasn’t’ are converted into two words like ‘wasn’ and ‘t’. We need to fix it.

In Python, we can use string.punctuation to get a bunch of punctuations at once. We will use that to remove punctuation from our text

print(string.punctuation)

So now we will split the words by whitespace and then remove all the punctuations which have been recorded in the data

words = text.split()
re_punc = re.compile('[%s]' % re.escape(string.punctuation))
stripped = [re_punc.sub('', word) for word in words]
print(stripped[:120])

Here we can see that we don’t have the words like ‘thought.’ but we also have words like ‘wasn’t’ which is correct.

Sometimes the text also contains characters that are not printable. We need to filter those out too. To do this, we can use Python ‘string.printable’ which gives us a bunch of characters that can be printed. So, we will remove the characters which are not present in this list.

re_print = re.compile('[^%s]' % re.escape(string.printable))
result = [re_print.sub('', word) for word in stripped]
print(result[:120])

Now let’s make all the words into lowercase. This will reduce our vocabulary. But this has some disadvantages also. After doing this, two words such as ‘Apple’ as in company and ‘apple’ as a fruit will be considered the same entity.

result = [word.lower() for word in result]
print(result[:120])

Also, words with one character won’t contribute to most of the NLP tasks. So we will be removing those too.

result = [word for word in result if len(word) > 1]
print(result[:120])

In NLP, frequent words such as ‘is’, ‘as’, ‘the’ do not contribute much to the model training. So such words are known as stop words and removing them in the text-cleaning process is suggested.

import nltk
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
print(stop_words)

result = [word for word in result if word not in set(stop_words)]
print(words[:110])

Now, at this point, we will reduce words with the same intent to a single word. For example, we will reduce the words ‘running’, ‘run’, and ‘runs’ to the word ‘run’ only since all three words give the same meaning to the model during training. This can be performed using PorterStemmer class in nltk library.

from nltk.stem.porter import PorterStemmer

ps = PorterStemmer()

result = [ps.stem(word) for word in result]
print(words[:110])

Stemmed words may or may not have a meaning. If you want your words to have a meaning, then rather than using the stemming technique, you can use a lemmatization technique which guarantees that the words will have meaning after transformation.

Now let’s remove the words which are not made of alphabets alone

result = [word for word in result if word.isalpha()]
print(result[:110])

At this stage, the textual data seems decent enough to be used for the word embedding techniques. But also note that there might be some additional steps to this process for some special kinds of data (for example, HTML code).

I hope you like the article. If you have any thoughts on the article then please let me know. Any constructive feedback is highly appreciated.

Connect with me on LinkedIn.

Mail me at shivamshinde92722@gmail.com

Have a great day!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Unlocking the Potential of Text: A Closer Look at Pre-Embedding Text Cleaning Methods

Author(s): Shivamshinde

This article will discuss different cleaning techniques that are essential to obtain maximum performance from textual data.

JOIN NOW!

🔥 Recommended Articles 🔥

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Unlocking the Potential of Text: A Closer Look at Pre-Embedding Text Cleaning Methods

Author(s): Shivamshinde

This article will discuss different cleaning techniques that are essential to obtain maximum performance from textual data.

JOIN NOW!

🔥 Recommended Articles 🔥

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement