Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Unlocking the Potential of Text: A Closer Look at Pre-Embedding Text Cleaning Methods
Data Science   Latest   Machine Learning

Unlocking the Potential of Text: A Closer Look at Pre-Embedding Text Cleaning Methods

Last Updated on August 1, 2023 by Editorial Team

Author(s): Shivamshinde

Originally published on Towards AI.

This article will discuss different cleaning techniques that are essential to obtain maximum performance from textual data.

Photo by Amador Loureiro on Unsplash

For the demonstration of the text cleaning methods, we will use the text dataset named β€˜metamorphosis’ from Kaggle.

Let’s start with importing the required Python libraries for the cleaning process.

import nltk, re, string
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

Now let’s load the dataset.

file_directory = 'link-to-the-dataset-local-directory'
file = open(file_directory, 'rt', encoding='utf-8')
text = file.read()
file.close()

note that for the above code cell to work, you need to put the local directory path of the data file.

Splitting the text data into words by whitespace

words = text.split()
print(words[:120])

Here we are seeing that the punctuation is preserved (e.g. armour-like and wasn’t), which is nice. We can also see that end-of-sentence punctuation is kept with the last word (e.g., thought.), which is not great.

So, this time let’s try splitting the data using non-word characters.

words = re.split(r'\W+', text)
print(words[:120])

Here we see that words like β€˜thought.’ have been converted into β€˜thought’. But the problem is that the words like β€˜wasn’t’ are converted into two words like β€˜wasn’ and β€˜t’. We need to fix it.

In Python, we can use string.punctuation to get a bunch of punctuations at once. We will use that to remove punctuation from our text

print(string.punctuation)

So now we will split the words by whitespace and then remove all the punctuations which have been recorded in the data

words = text.split()
re_punc = re.compile('[%s]' % re.escape(string.punctuation))
stripped = [re_punc.sub('', word) for word in words]
print(stripped[:120])

Here we can see that we don’t have the words like β€˜thought.’ but we also have words like β€˜wasn’t’ which is correct.

Sometimes the text also contains characters that are not printable. We need to filter those out too. To do this, we can use Python β€˜string.printable’ which gives us a bunch of characters that can be printed. So, we will remove the characters which are not present in this list.

re_print = re.compile('[^%s]' % re.escape(string.printable))
result = [re_print.sub('', word) for word in stripped]
print(result[:120])

Now let’s make all the words into lowercase. This will reduce our vocabulary. But this has some disadvantages also. After doing this, two words such as β€˜Apple’ as in company and β€˜apple’ as a fruit will be considered the same entity.

result = [word.lower() for word in result]
print(result[:120])

Also, words with one character won’t contribute to most of the NLP tasks. So we will be removing those too.

result = [word for word in result if len(word) > 1]
print(result[:120])

In NLP, frequent words such as β€˜is’, β€˜as’, β€˜the’ do not contribute much to the model training. So such words are known as stop words and removing them in the text-cleaning process is suggested.

import nltk
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
print(stop_words)
result = [word for word in result if word not in set(stop_words)]
print(words[:110])

Now, at this point, we will reduce words with the same intent to a single word. For example, we will reduce the words β€˜running’, β€˜run’, and β€˜runs’ to the word β€˜run’ only since all three words give the same meaning to the model during training. This can be performed using PorterStemmer class in nltk library.

from nltk.stem.porter import PorterStemmer

ps = PorterStemmer()

result = [ps.stem(word) for word in result]
print(words[:110])

Stemmed words may or may not have a meaning. If you want your words to have a meaning, then rather than using the stemming technique, you can use a lemmatization technique which guarantees that the words will have meaning after transformation.

Now let’s remove the words which are not made of alphabets alone

result = [word for word in result if word.isalpha()]
print(result[:110])

At this stage, the textual data seems decent enough to be used for the word embedding techniques. But also note that there might be some additional steps to this process for some special kinds of data (for example, HTML code).

I hope you like the article. If you have any thoughts on the article then please let me know. Any constructive feedback is highly appreciated.

Connect with me on LinkedIn.

Mail me at [email protected]

Have a great day!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓