Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

How and Why to Implement Stemming and Lemmatization from NLTK
Latest   Machine Learning

How and Why to Implement Stemming and Lemmatization from NLTK

Last Updated on July 24, 2023 by Editorial Team

Author(s): Manmohan Singh

Originally published on Towards AI.

In this article, we try to solve one of NLP’s problems by implementing Stemming and Lemmatization

Source: pixxabay.com

The English language has more than a million words in its vocabulary. Around 170k are in current use. These words grouped to form a sentence by following grammatical rules. Due to logical reasons, sentences use a different form of words derived from one another, such as plays, played, and playing.

While working in Natural Language Processing (NLP) models and problems, these words not help much. The main focus of NLP problems is to achieve the result from fewer words. Solving this problem saves a lot of processing time and disk space.

In this article, we try to solve this NLP problem by implementing Stemming and Lemmatization. Both methods convert derived words to their base words.

However, these two methods use different algorithms and are not the same; this article we go over these differences and Natural Language ToolKit (NLTK) implementation.

Stemming

Stemming achieves the root word by cutting the last alphabet letters of a word. These root words are also known as stems. But stem not always become a root word. And the sentence becomes meaningless. Stemming also reduces the accuracy of a model.

There are different types of stemming algorithms. We use only Porter’s algorithm and the Snowball algorithm in this article. These algorithms are most effective than others.

NLTK implementation of Porter Stemmer.

import nltkporter_stemmer = nltk.PorterStemmer()text = f” He determined to drop his litigation with the monastery, and relinquish”\
f” his claims to the wood cutting and fishery rights at once. “\
f”He was more ready to do this.”
text_without_stopword = [porter_stemmer.stem(word) for word in text.split()]print(f”Original text: {text} \n”)
print(f”Stemmed text : {‘ ‘.join(text_without_stopword)}”)

This method converted the words ‘ready’ and ‘this’ to ‘readi’ and ‘thi’ and make the sentence meaningless. Also, after the conversion of the word ‘his’ to ‘hi,’ the meaning of the sentence changes. I do not recommend this method to build any critical project. Use this method for study purposes only.

Snowball Stemmer is an improved version of the Porter stemmer. This method is highly precise over large data-sets.

NLTK implementation of SnowBall Stemmer.

import nltksnowball_stemmer = nltk.SnowballStemmer(‘english’)text = f” He determined to drop his litigation with the monastery, and relinquish”\
f” his claims to the wood cutting and fishery rights at once. “\
f”He was more ready to do this.”
text_without_stopword = [snowball_stemmer.stem(word) for word in text.split()]print(f”Original text: {text} \n”)
print(f”Stemmed text : {‘ ‘.join(text_without_stopword)}”)

Word his not converted to hi by this method. Letters are properly chopped off from words cutting, claims, and rights. We can say that there is an improvement. But the conversion of words ‘once’ and ‘monastry’ to ‘onc’ and ‘monastri’ makes the sentence meaningless.

Lemmatization

Lemmatization method has analyzed the structure of words, the relationship between words and parts of words to accurately identify the root word. Part of speech tagger and vocabulary words helps to return the dictionary form of a word. But this requires a lot of processing time and disk space as compared to Stemming method. The accuracy of the NLP model is comparatively high in this method. The root word is known as a lemma.

NLTK implementation of Lemmatization.

from nltk.stem import WordNetLemmatizerlemmatizer = WordNetLemmatizer()text = f” He determined to drop his litigation with the monastery, and relinquish”\
f” his claims to the wood cutting and fishery rights at once. “\
f”He was more ready to do this.”
text_without_stopword = [lemmatizer.lemmatize(word) for word in text.split()]print(f”Original text: {text} \n”)
print(f”Lemmetazied text : {‘ ‘.join(text_without_stopword)}”)

The lemmatization method converts the words ‘claims’ and ‘rights’ to ‘claim’ and ‘right.’ Other words are un-affected. The meaning of sentences is intact.

Code to distinguish between Lemmatization and Stemming

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
ps = nltk.PorterStemmer()
stemmer = nltk.SnowballStemmer(‘english’)
text = f”He determined to drop his litigation with the monastery, and relinquish”\
f” his claims to the wood cutting and fishery rights at once. “\
f”He was more ready to do this.”
porter_stem_text = [ps.stem(word) for word in text.split()]
snowball_stem_text = [stemmer.stem(word) for word in text.split()]
lemmatize_stem_text = [lemmatizer.lemmatize(word) for word in text.split()]
print(f”Original text: {text} \n”)
print(f”Porter Stemmed text : {‘ ‘.join(porter_stem_text)}\n”)
print(f”Snoball Stemmed text :{‘ ‘.join(snowball_stem_text)}\n”)
print(f”Lemmatize text : {‘ ‘.join(lemmatize_stem_text)}\n”)

Porter and Snoball stemming methods convert some words to non-dictionary words. Such conversion of words restricts the use of porter and snowball stemming methods to search engines, n-gram context, and text classification problems.

Lemmatization can be used in paragraph/document summarization, word/sentence prediction, sentiment analysis, and others.

Conclusion

The selection of Stemming or Lemmatization is solely dependent upon project requirements. Lemmatization is mandatory for critical projects and projects where sentence structure matter like language applications. Stemming or Lemmatization do affect precision and recall. Stemming reduces precision performance, and increases recall performance.

Hopefully, this article helps you with NLP models and problems.

Other Articles by Author

  1. First step in EDA : Descriptive Statistic Analysis
  2. Automate Sentiment Analysis Process for Reddit Post: TextBlob and VADER
  3. Discover the Sentiment of Reddit Subgroup using RoBERTa Model

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓