Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Stemming: Porter Vs. Snowball Vs. Lancaster
Latest

Stemming: Porter Vs. Snowball Vs. Lancaster

Last Updated on January 6, 2023 by Editorial Team

Last Updated on September 2, 2022 by Editorial Team

Author(s): Kaustubh Bhavsar

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

Learn how different popular Stemmers work and how Stemming differs from Lemmatization

Terms such as connections, connection, connected, and connecting have a common stem — connect.
Source: Image from Canva, edited freely available image by the author.

Introduction

A common application of Natural Language Processing (NLP) is Information Retrieval (IR). The IR system deals with obtaining related resources that are relevant to the search query from a large collection of those resources. Suppose we want to retrieve the resources for the word ‘connect’. For grammatical reasons, this same word will take on different forms like ‘connected’, ‘connection’, or ‘connecting’. All these words have the same meaning, but depending on the context in which they are used, they all vary in spelling or ending. It makes sense to use only one word to represent a vector of relevant words to search for a set of related resources that are relevant to the search query.

Why should we use a single word to represent a vector of relevant words?

  • The use of a single word to represent a vector of related words will typically improve the performance of an IR system.
  • Reducing the number of words will also reduce the overall size and, thus the complexity of the total data in the system.

What is Stemming?

Stemming is the process of identifying the root form, also called the base form of a word, by either replacing or removing word suffixes. Researcher J.B. Lovins, in her article, “Development of a Stemming Algorithm,” defines stemming as:

“A stemming algorithm is a computational procedure which reduces all words with the same root… to a common form, usually by stripping each word of its derivational and inflectional suffixes.”

For example, ‘boat’ is the root of the words: ‘boats’, ‘boater’, or ‘boating’.

In stem, the root of the word is called the stem. So, in our case, ‘boat’ is called a stem. To make our concept stronger, let’s understand another example; ‘connect’ is the stem for the following words: ‘connect’, ‘connecting’, ‘connection’, ‘connections’, or ‘connected’.

Prefix stripping is not widely used for stemming; however, it may be useful in certain subjects, such as chemistry.

Does it mean that Stemming will always result in a valid root word?

No. For example, if we process the word ‘transparent’ through the porter stemmer, the root word that we get is ‘transpar. Similarly, if we process the same word through the Lancaster stemmer, then the root word we get is ‘transp’. C.D. Paice, in “Another Stemmer,” mentions that it’s usually sufficient for IR systems to map the related words to the same stem; however, the root word doesn’t need to be valid.

“…the process is aimed at mapping for retrieval purposes, the stem need not be a linguistically correct lemma or root.”

Errors in Stemming:

  • Over-Stemming: It occurs when two or more unrelated words result in the same stem.
  • Under-Stemming: It occurs when two or more related words result in different stems.

Let’s go through the three most popular stemmers: Porter, Snowball, and Lancaster.

Note: Detailed explanation of the working of algorithms is beyond the scope of this article.

Porter Stemmer

It is one of the most commonly used stemmers, developed by M.F. Porter in 1980. Porter’s stemmer consists of five different phases. These phases are applied sequentially. Within each phase, there are certain conventions for selecting rules. The entire porter algorithm is small and thus fast and simple. The drawback of this stemmer is that it supports only the English language, and the stem obtained may or may not be linguistically correct.

The code snippet shown above will produce: wa, found, mice, run, run, ran

Notice how the stem of ‘was’ is ‘wa’ according to porter algorithm which is linguistically invalid.

Snowball Stemmer

M.F. Porter also developed snowball stemmer. Snowball is a string processing language that is mainly developed to create stemming algorithms. It was created by Porter as an improvement over his previously created porter algorithm. It supports multiple languages, including English, Russian, Danish, French, Finnish, German, Italian, Hungarian, Portuguese, Norwegian, Swedish, and Spanish. The snowball stemmer presenting the English language stemmer is called Porter2.

The code snippet shown above will produce: was, found, mice, run, run, ran

According to M.F. Porter, the stemming of stopwords like ‘being’ to ‘be is useless because they don’t have any shared meaning, although there could be a grammatical connection between these two words. Snowball provides another parameter called ignore_stopwords, which is set to false by default. If it is set to true, then snowball will not perform the stemming of stopwords.

The code snippet shown above will produce: being

If ignore_stopwords is set to false, then the same code snippet will output: ‘be’. Try it for yourself.

Lancaster Stemmer

Lancaster stemmer is also called Paice or Husk stemmer. It was developed by C.D. Paice at Lancaster University in 1990. It uses an iterative approach, and this makes it the most aggressive algorithm among the three stemmers described in this article. Due to its iterative approach, it may lead to over-stemming, which may result in the linguistically incorrect roots. It is not as efficient as a porter or snowball stemmer. Also, it only supports the English language.

The code snippet shown above will produce: was, found, mic, run, run, ran

As Lancaster stemmer uses a more aggressive approach, the word ‘micegets reduced to ‘mic.

What is Lemmatization? And how does it differ from Stemming?

As we have seen so far, a stemming algorithm may or may not return a valid stem. This is not an issue for IR systems; however, can such invalid stems be used in language modeling, where linguistically correct forms are critical? We probably can’t. To overcome this drawback of stemming, we use lemmatization that returns us to a linguistically correct root. Lemmatization identifies inflected forms of a word and returns its linguistically correct root. In the book “An Introduction to Informational Retrieval,” they define lemmatization as:

“Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word…”

💡 Inflected form of a word has a changed spelling or ending.

The root of a word in lemmatization is called lemma. So, for words like ‘run’, ‘runs, ‘running’, or ‘ran’; ‘run’ is the lemma. For words like ‘mice’, the lemma is ‘mouse.

One of the most used lemmatizers is the WordNet which is provided by NLTK. To get the precise lemma, we must provide an appropriate part of speech (pos) tag. By default, every word is treated as a noun (n). We’ll pass verb (v) as the value to the pos parameter for all the words in the list.

The code snippet shown above will produce: be, find, mice, run, run, run

Here, we observe that ‘was’, ‘found’, ‘run’, ‘running’, and ‘ran’ are verbs, so the output we get is precise. However, the output for ‘mice’ is ‘mice’. Why? This happened because the word ‘mice’ is a noun, so changing the pos tag to a noun will give us the output as a ‘mouse’. Try it for yourself.

Conclusion

We saw how stemming, and lemmatization differ, as well as how different stemmers work. There are many more different stemmers and lemmatizers available than the ones mentioned in this article. To summarize the entire article, let’s answer two questions.

First, should we choose stemming or lemmatization for the preprocessing step? It depends on the application that is being created. Stemming is fast compared to lemmatization. Also, stemming may or may not return a valid stem or root, whereas lemmatization will return a linguistically correct root. So, in applications where speed matters, like search and retrieval systems, stemming could be preferred; and in applications where valid root matters, like in language modeling, lemmatization could be preferred. Remember, the final decision on what should be used for the pre-processing step will always depend on the application being created and the application creator. Try experimenting with both and see how the results differ.

Secondly, is stemming used only in IR systems? No. Stemming could be used in applications that require the transformation of morphological forms of words to their roots. This means that stemming could be performed in text summarization or even in text categorization.

Let me know in the comment section if you’ve used or tried experimenting with these stemmers or lemmatizers and how beneficial they were for you!

You can connect and reach out to me via LinkedIn.


Stemming: Porter Vs. Snowball Vs. Lancaster was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓