How and Why to Implement Stemming and Lemmatization from NLTK

Last Updated on July 24, 2023 by Editorial Team

Author(s): Manmohan Singh

Originally published on Towards AI.

In this article, we try to solve one of NLP’s problems by implementing Stemming and Lemmatization

The English language has more than a million words in its vocabulary. Around 170k are in current use. These words grouped to form a sentence by following grammatical rules. Due to logical reasons, sentences use a different form of words derived from one another, such as plays, played, and playing.

While working in Natural Language Processing (NLP) models and problems, these words not help much. The main focus of NLP problems is to achieve the result from fewer words. Solving this problem saves a lot of processing time and disk space.

In this article, we try to solve this NLP problem by implementing Stemming and Lemmatization. Both methods convert derived words to their base words.

However, these two methods use different algorithms and are not the same; this article we go over these differences and Natural Language ToolKit (NLTK) implementation.

Stemming

Stemming achieves the root word by cutting the last alphabet letters of a word. These root words are also known as stems. But stem not always become a root word. And the sentence becomes meaningless. Stemming also reduces the accuracy of a model.

There are different types of stemming algorithms. We use only Porter’s algorithm and the Snowball algorithm in this article. These algorithms are most effective than others.

NLTK implementation of Porter Stemmer.

import nltkporter_stemmer = nltk.PorterStemmer()text = f” He determined to drop his litigation with the monastery, and relinquish”\
 f” his claims to the wood cutting and fishery rights at once. “\
 f”He was more ready to do this.”text_without_stopword = [porter_stemmer.stem(word) for word in text.split()]print(f”Original text: {text} \n”)
print(f”Stemmed text : {‘ ‘.join(text_without_stopword)}”)

This method converted the words ‘ready’ and ‘this’ to ‘readi’ and ‘thi’ and make the sentence meaningless. Also, after the conversion of the word ‘his’ to ‘hi,’ the meaning of the sentence changes. I do not recommend this method to build any critical project. Use this method for study purposes only.

Snowball Stemmer is an improved version of the Porter stemmer. This method is highly precise over large data-sets.

NLTK implementation of SnowBall Stemmer.

import nltksnowball_stemmer = nltk.SnowballStemmer(‘english’)text = f” He determined to drop his litigation with the monastery, and relinquish”\
 f” his claims to the wood cutting and fishery rights at once. “\
 f”He was more ready to do this.”text_without_stopword = [snowball_stemmer.stem(word) for word in text.split()]print(f”Original text: {text} \n”)
print(f”Stemmed text : {‘ ‘.join(text_without_stopword)}”)

Word his not converted to hi by this method. Letters are properly chopped off from words cutting, claims, and rights. We can say that there is an improvement. But the conversion of words ‘once’ and ‘monastry’ to ‘onc’ and ‘monastri’ makes the sentence meaningless.

Lemmatization

Lemmatization method has analyzed the structure of words, the relationship between words and parts of words to accurately identify the root word. Part of speech tagger and vocabulary words helps to return the dictionary form of a word. But this requires a lot of processing time and disk space as compared to Stemming method. The accuracy of the NLP model is comparatively high in this method. The root word is known as a lemma.

NLTK implementation of Lemmatization.

from nltk.stem import WordNetLemmatizerlemmatizer = WordNetLemmatizer()text = f” He determined to drop his litigation with the monastery, and relinquish”\
 f” his claims to the wood cutting and fishery rights at once. “\
 f”He was more ready to do this.”text_without_stopword = [lemmatizer.lemmatize(word) for word in text.split()]print(f”Original text: {text} \n”)
print(f”Lemmetazied text : {‘ ‘.join(text_without_stopword)}”)

The lemmatization method converts the words ‘claims’ and ‘rights’ to ‘claim’ and ‘right.’ Other words are un-affected. The meaning of sentences is intact.

Code to distinguish between Lemmatization and Stemming

import nltk
from nltk.stem import WordNetLemmatizerlemmatizer = WordNetLemmatizer()
ps = nltk.PorterStemmer()
stemmer = nltk.SnowballStemmer(‘english’)text = f”He determined to drop his litigation with the monastery, and relinquish”\
 f” his claims to the wood cutting and fishery rights at once. “\
 f”He was more ready to do this.”porter_stem_text = [ps.stem(word) for word in text.split()]
snowball_stem_text = [stemmer.stem(word) for word in text.split()]
lemmatize_stem_text = [lemmatizer.lemmatize(word) for word in text.split()]print(f”Original text: {text} \n”)
print(f”Porter Stemmed text : {‘ ‘.join(porter_stem_text)}\n”)
print(f”Snoball Stemmed text :{‘ ‘.join(snowball_stem_text)}\n”)
print(f”Lemmatize text : {‘ ‘.join(lemmatize_stem_text)}\n”)

Porter and Snoball stemming methods convert some words to non-dictionary words. Such conversion of words restricts the use of porter and snowball stemming methods to search engines, n-gram context, and text classification problems.

Lemmatization can be used in paragraph/document summarization, word/sentence prediction, sentiment analysis, and others.

Conclusion

The selection of Stemming or Lemmatization is solely dependent upon project requirements. Lemmatization is mandatory for critical projects and projects where sentence structure matter like language applications. Stemming or Lemmatization do affect precision and recall. Stemming reduces precision performance, and increases recall performance.

Hopefully, this article helps you with NLP models and problems.

Other Articles by Author

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

How and Why to Implement Stemming and Lemmatization from NLTK

Author(s): Manmohan Singh

In this article, we try to solve one of NLP’s problems by implementing Stemming and Lemmatization

Stemming

Lemmatization

Conclusion

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

The Fundamental Mathematics of Machine Learning

Built-In AI Web APIs Will Enable A New Generation Of AI Startups

Auditing Predictive A.I. Models for Bias and Fairness

Why is Llama 3.1 Such a Big deal?

5 AI Real-World Projects To Set Foot in The Door

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

How and Why to Implement Stemming and Lemmatization from NLTK

Author(s): Manmohan Singh

In this article, we try to solve one of NLP’s problems by implementing Stemming and Lemmatization

Stemming

Lemmatization

Conclusion

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement