How and Why to Implement Stemming and Lemmatization from NLTK
Last Updated on July 24, 2023 by Editorial Team
Author(s): Manmohan Singh
Originally published on Towards AI.
In this article, we try to solve one of NLPβs problems by implementing Stemming and Lemmatization
The English language has more than a million words in its vocabulary. Around 170k are in current use. These words grouped to form a sentence by following grammatical rules. Due to logical reasons, sentences use a different form of words derived from one another, such as plays, played, and playing.
While working in Natural Language Processing (NLP) models and problems, these words not help much. The main focus of NLP problems is to achieve the result from fewer words. Solving this problem saves a lot of processing time and disk space.
In this article, we try to solve this NLP problem by implementing Stemming and Lemmatization. Both methods convert derived words to their base words.
However, these two methods use different algorithms and are not the same; this article we go over these differences and Natural Language ToolKit (NLTK) implementation.
Stemming
Stemming achieves the root word by cutting the last alphabet letters of a word. These root words are also known as stems. But stem not always become a root word. And the sentence becomes meaningless. Stemming also reduces the accuracy of a model.
There are different types of stemming algorithms. We use only Porterβs algorithm and the Snowball algorithm in this article. These algorithms are most effective than others.
NLTK implementation of Porter Stemmer.
import nltkporter_stemmer = nltk.PorterStemmer()text = fβ He determined to drop his litigation with the monastery, and relinquishβ\
fβ his claims to the wood cutting and fishery rights at once. β\
fβHe was more ready to do this.βtext_without_stopword = [porter_stemmer.stem(word) for word in text.split()]print(fβOriginal text: {text} \nβ)
print(fβStemmed text : {β β.join(text_without_stopword)}β)
This method converted the words βreadyβ and βthisβ to βreadiβ and βthiβ and make the sentence meaningless. Also, after the conversion of the word βhisβ to βhi,β the meaning of the sentence changes. I do not recommend this method to build any critical project. Use this method for study purposes only.
Snowball Stemmer is an improved version of the Porter stemmer. This method is highly precise over large data-sets.
NLTK implementation of SnowBall Stemmer.
import nltksnowball_stemmer = nltk.SnowballStemmer(βenglishβ)text = fβ He determined to drop his litigation with the monastery, and relinquishβ\
fβ his claims to the wood cutting and fishery rights at once. β\
fβHe was more ready to do this.βtext_without_stopword = [snowball_stemmer.stem(word) for word in text.split()]print(fβOriginal text: {text} \nβ)
print(fβStemmed text : {β β.join(text_without_stopword)}β)
Word his not converted to hi by this method. Letters are properly chopped off from words cutting, claims, and rights. We can say that there is an improvement. But the conversion of words βonceβ and βmonastryβ to βoncβ and βmonastriβ makes the sentence meaningless.
Lemmatization
Lemmatization method has analyzed the structure of words, the relationship between words and parts of words to accurately identify the root word. Part of speech tagger and vocabulary words helps to return the dictionary form of a word. But this requires a lot of processing time and disk space as compared to Stemming method. The accuracy of the NLP model is comparatively high in this method. The root word is known as a lemma.
NLTK implementation of Lemmatization.
from nltk.stem import WordNetLemmatizerlemmatizer = WordNetLemmatizer()text = fβ He determined to drop his litigation with the monastery, and relinquishβ\
fβ his claims to the wood cutting and fishery rights at once. β\
fβHe was more ready to do this.βtext_without_stopword = [lemmatizer.lemmatize(word) for word in text.split()]print(fβOriginal text: {text} \nβ)
print(fβLemmetazied text : {β β.join(text_without_stopword)}β)
The lemmatization method converts the words βclaimsβ and βrightsβ to βclaimβ and βright.β Other words are un-affected. The meaning of sentences is intact.
Code to distinguish between Lemmatization and Stemming
import nltk
from nltk.stem import WordNetLemmatizerlemmatizer = WordNetLemmatizer()
ps = nltk.PorterStemmer()
stemmer = nltk.SnowballStemmer(βenglishβ)text = fβHe determined to drop his litigation with the monastery, and relinquishβ\
fβ his claims to the wood cutting and fishery rights at once. β\
fβHe was more ready to do this.βporter_stem_text = [ps.stem(word) for word in text.split()]
snowball_stem_text = [stemmer.stem(word) for word in text.split()]
lemmatize_stem_text = [lemmatizer.lemmatize(word) for word in text.split()]print(fβOriginal text: {text} \nβ)
print(fβPorter Stemmed text : {β β.join(porter_stem_text)}\nβ)
print(fβSnoball Stemmed text :{β β.join(snowball_stem_text)}\nβ)
print(fβLemmatize text : {β β.join(lemmatize_stem_text)}\nβ)
Porter and Snoball stemming methods convert some words to non-dictionary words. Such conversion of words restricts the use of porter and snowball stemming methods to search engines, n-gram context, and text classification problems.
Lemmatization can be used in paragraph/document summarization, word/sentence prediction, sentiment analysis, and others.
Conclusion
The selection of Stemming or Lemmatization is solely dependent upon project requirements. Lemmatization is mandatory for critical projects and projects where sentence structure matter like language applications. Stemming or Lemmatization do affect precision and recall. Stemming reduces precision performance, and increases recall performance.
Hopefully, this article helps you with NLP models and problems.
Other Articles by Author
- First step in EDA : Descriptive Statistic Analysis
- Automate Sentiment Analysis Process for Reddit Post: TextBlob and VADER
- Discover the Sentiment of Reddit Subgroup using RoBERTa Model
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI