Unlocking the Potential of Text: A Closer Look at Pre-Embedding Text Cleaning Methods
Last Updated on August 1, 2023 by Editorial Team
Author(s): Shivamshinde
Originally published on Towards AI.
This article will discuss different cleaning techniques that are essential to obtain maximum performance from textual data.
For the demonstration of the text cleaning methods, we will use the text dataset named βmetamorphosisβ from Kaggle.
Letβs start with importing the required Python libraries for the cleaning process.
import nltk, re, string
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
Now letβs load the dataset.
file_directory = 'link-to-the-dataset-local-directory'
file = open(file_directory, 'rt', encoding='utf-8')
text = file.read()
file.close()
note that for the above code cell to work, you need to put the local directory path of the data file.
Splitting the text data into words by whitespace
words = text.split()
print(words[:120])
Here we are seeing that the punctuation is preserved (e.g. armour-like and wasnβt), which is nice. We can also see that end-of-sentence punctuation is kept with the last word (e.g., thought.), which is not great.
So, this time letβs try splitting the data using non-word characters.
words = re.split(r'\W+', text)
print(words[:120])
Here we see that words like βthought.β have been converted into βthoughtβ. But the problem is that the words like βwasnβtβ are converted into two words like βwasnβ and βtβ. We need to fix it.
In Python, we can use string.punctuation to get a bunch of punctuations at once. We will use that to remove punctuation from our text
print(string.punctuation)
So now we will split the words by whitespace and then remove all the punctuations which have been recorded in the data
words = text.split()
re_punc = re.compile('[%s]' % re.escape(string.punctuation))
stripped = [re_punc.sub('', word) for word in words]
print(stripped[:120])
Here we can see that we donβt have the words like βthought.β but we also have words like βwasnβtβ which is correct.
Sometimes the text also contains characters that are not printable. We need to filter those out too. To do this, we can use Python βstring.printableβ which gives us a bunch of characters that can be printed. So, we will remove the characters which are not present in this list.
re_print = re.compile('[^%s]' % re.escape(string.printable))
result = [re_print.sub('', word) for word in stripped]
print(result[:120])
Now letβs make all the words into lowercase. This will reduce our vocabulary. But this has some disadvantages also. After doing this, two words such as βAppleβ as in company and βappleβ as a fruit will be considered the same entity.
result = [word.lower() for word in result]
print(result[:120])
Also, words with one character wonβt contribute to most of the NLP tasks. So we will be removing those too.
result = [word for word in result if len(word) > 1]
print(result[:120])
In NLP, frequent words such as βisβ, βasβ, βtheβ do not contribute much to the model training. So such words are known as stop words and removing them in the text-cleaning process is suggested.
import nltk
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words)
result = [word for word in result if word not in set(stop_words)]
print(words[:110])
Now, at this point, we will reduce words with the same intent to a single word. For example, we will reduce the words βrunningβ, βrunβ, and βrunsβ to the word βrunβ only since all three words give the same meaning to the model during training. This can be performed using PorterStemmer class in nltk library.
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
result = [ps.stem(word) for word in result]
print(words[:110])
Stemmed words may or may not have a meaning. If you want your words to have a meaning, then rather than using the stemming technique, you can use a lemmatization technique which guarantees that the words will have meaning after transformation.
Now letβs remove the words which are not made of alphabets alone
result = [word for word in result if word.isalpha()]
print(result[:110])
At this stage, the textual data seems decent enough to be used for the word embedding techniques. But also note that there might be some additional steps to this process for some special kinds of data (for example, HTML code).
I hope you like the article. If you have any thoughts on the article then please let me know. Any constructive feedback is highly appreciated.
Connect with me on LinkedIn.
Mail me at [email protected]
Have a great day!
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI