Stemming: Porter Vs. Snowball Vs. Lancaster
Last Updated on January 6, 2023 by Editorial Team
Last Updated on September 2, 2022 by Editorial Team
Author(s): Kaustubh Bhavsar
Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.
Learn how different popular Stemmers work and how Stemming differs from Lemmatization
Introduction
A common application of Natural Language Processing (NLP) is Information Retrieval (IR). The IR system deals with obtaining related resources that are relevant to the search query from a large collection of those resources. Suppose we want to retrieve the resources for the word βconnectβ. For grammatical reasons, this same word will take on different forms like βconnectedβ, βconnectionβ, or βconnectingβ. All these words have the same meaning, but depending on the context in which they are used, they all vary in spelling or ending. It makes sense to use only one word to represent a vector of relevant words to search for a set of related resources that are relevant to the searchΒ query.
Why should we use a single word to represent a vector of relevantΒ words?
- The use of a single word to represent a vector of related words will typically improve the performance of an IRΒ system.
- Reducing the number of words will also reduce the overall size and, thus the complexity of the total data in theΒ system.
What is Stemming?
Stemming is the process of identifying the root form, also called the base form of a word, by either replacing or removing word suffixes. Researcher J.B. Lovins, in her article, βDevelopment of a Stemming Algorithm,β defines stemmingΒ as:
βA stemming algorithm is a computational procedure which reduces all words with the same rootβ¦ to a common form, usually by stripping each word of its derivational and inflectional suffixes.β
For example, βboatβ is the root of the words: βboatsβ, βboaterβ, or βboatingβ.
In stem, the root of the word is called the stem. So, in our case, βboatβ is called a stem. To make our concept stronger, letβs understand another example; βconnectβ is the stem for the following words: βconnectβ, βconnectingβ, βconnectionβ, βconnectionsβ, or βconnectedβ.
Prefix stripping is not widely used for stemming; however, it may be useful in certain subjects, such as chemistry.
Does it mean that Stemming will always result in a valid rootΒ word?
No. For example, if we process the word βtransparentβ through the porter stemmer, the root word that we get is βtransparβ. Similarly, if we process the same word through the Lancaster stemmer, then the root word we get is βtranspβ. C.D. Paice, in βAnother Stemmer,β mentions that itβs usually sufficient for IR systems to map the related words to the same stem; however, the root word doesnβt need to beΒ valid.
ββ¦the process is aimed at mapping for retrieval purposes, the stem need not be a linguistically correct lemma orΒ root.β
Errors in Stemming:
- Over-Stemming: It occurs when two or more unrelated words result in the sameΒ stem.
- Under-Stemming: It occurs when two or more related words result in different stems.
Letβs go through the three most popular stemmers: Porter, Snowball, and Lancaster.
Note: Detailed explanation of the working of algorithms is beyond the scope of thisΒ article.
Porter Stemmer
It is one of the most commonly used stemmers, developed by M.F. Porter in 1980. Porterβs stemmer consists of five different phases. These phases are applied sequentially. Within each phase, there are certain conventions for selecting rules. The entire porter algorithm is small and thus fast and simple. The drawback of this stemmer is that it supports only the English language, and the stem obtained may or may not be linguistically correct.
The code snippet shown above will produce: wa, found, mice, run, run,Β ran
Notice how the stem of βwasβ is βwaβ according to porter algorithm which is linguistically invalid.
Snowball Stemmer
M.F. Porter also developed snowball stemmer. Snowball is a string processing language that is mainly developed to create stemming algorithms. It was created by Porter as an improvement over his previously created porter algorithm. It supports multiple languages, including English, Russian, Danish, French, Finnish, German, Italian, Hungarian, Portuguese, Norwegian, Swedish, and Spanish. The snowball stemmer presenting the English language stemmer is calledΒ Porter2.
The code snippet shown above will produce: was, found, mice, run, run,Β ran
According to M.F. Porter, the stemming of stopwords like βbeingβ to βbeβ is useless because they donβt have any shared meaning, although there could be a grammatical connection between these two words. Snowball provides another parameter called ignore_stopwords, which is set to false by default. If it is set to true, then snowball will not perform the stemming of stopwords.
The code snippet shown above will produce:Β being
If ignore_stopwords is set to false, then the same code snippet will output: βbeβ. Try it for yourself.
Lancaster Stemmer
Lancaster stemmer is also called Paice or Husk stemmer. It was developed by C.D. Paice at Lancaster University in 1990. It uses an iterative approach, and this makes it the most aggressive algorithm among the three stemmers described in this article. Due to its iterative approach, it may lead to over-stemming, which may result in the linguistically incorrect roots. It is not as efficient as a porter or snowball stemmer. Also, it only supports the English language.
The code snippet shown above will produce: was, found, mic, run, run,Β ran
As Lancaster stemmer uses a more aggressive approach, the word βmiceβ gets reduced toΒ βmicβ.
What is Lemmatization? And how does it differ from Stemming?
As we have seen so far, a stemming algorithm may or may not return a valid stem. This is not an issue for IR systems; however, can such invalid stems be used in language modeling, where linguistically correct forms are critical? We probably canβt. To overcome this drawback of stemming, we use lemmatization that returns us to a linguistically correct root. Lemmatization identifies inflected forms of a word and returns its linguistically correct root. In the book βAn Introduction to Informational Retrieval,β they define lemmatization as:
βLemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of aΒ wordβ¦β
π‘ Inflected form of a word has a changed spelling orΒ ending.
The root of a word in lemmatization is called lemma. So, for words like βrunβ, βrunsβ, βrunningβ, or βranβ; βrunβ is the lemma. For words like βmiceβ, the lemma isΒ βmouseβ.
One of the most used lemmatizers is the WordNet which is provided by NLTK. To get the precise lemma, we must provide an appropriate part of speech (pos) tag. By default, every word is treated as a noun (n). Weβll pass verb (v) as the value to the pos parameter for all the words in theΒ list.
The code snippet shown above will produce: be, find, mice, run, run,Β run
Here, we observe that βwasβ, βfoundβ, βrunβ, βrunningβ, and βranβ are verbs, so the output we get is precise. However, the output for βmiceβ is βmiceβ. Why? This happened because the word βmiceβ is a noun, so changing the pos tag to a noun will give us the output as a βmouseβ. Try it for yourself.
Conclusion
We saw how stemming, and lemmatization differ, as well as how different stemmers work. There are many more different stemmers and lemmatizers available than the ones mentioned in this article. To summarize the entire article, letβs answer two questions.
First, should we choose stemming or lemmatization for the preprocessing step? It depends on the application that is being created. Stemming is fast compared to lemmatization. Also, stemming may or may not return a valid stem or root, whereas lemmatization will return a linguistically correct root. So, in applications where speed matters, like search and retrieval systems, stemming could be preferred; and in applications where valid root matters, like in language modeling, lemmatization could be preferred. Remember, the final decision on what should be used for the pre-processing step will always depend on the application being created and the application creator. Try experimenting with both and see how the resultsΒ differ.
Secondly, is stemming used only in IR systems? No. Stemming could be used in applications that require the transformation of morphological forms of words to their roots. This means that stemming could be performed in text summarization or even in text categorization.
Let me know in the comment section if youβve used or tried experimenting with these stemmers or lemmatizers and how beneficial they were forΒ you!
You can connect and reach out to me via LinkedIn.
Stemming: Porter Vs. Snowball Vs. Lancaster was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Join thousands of data leaders on the AI newsletter. Itβs free, we donβt spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI