Natural Language Processing (NLP) with Python — Tutorial

Last Updated on October 21, 2021 by Editorial Team

Author(s): Pratik Shukla, Roberto Iriondo

Natural Language Processing, Scholarly, Tutorial

Tutorial on the basics of natural language processing (NLP) with sample code implementation in Python

In this article, we explore the basics of natural language processing (NLP) with code examples. We dive into the natural language toolkit (NLTK) library to present how it can be useful for natural language processing related-tasks. Afterward, we will discuss the basics of other Natural Language Processing libraries and other essential methods for NLP, along with their respective coding sample implementations in Python.

📚 Resources: Google Colab Implementation | GitHub Repository 📚

What is Natural Language Processing?
Applications of NLP
Understanding Natural Language Processing (NLP)
Rule-based NLP vs. Statistical NLP
Components of Natural Language Processing (NLP)
Current challenges in NLP
Easy to Use NLP Libraries
Exploring Features of NLTK
Word Cloud
Stemming
Lemmatization
Part-of-Speech (PoS) tagging
Chunking
Chinking
Named Entity Recognition (NER)
WordNet
Bag of Words
TF-IDF

What is Natural Language Processing?

Computers and machines are great at working with tabular data or spreadsheets. However, as human beings generally communicate in words and sentences, not in the form of tables. Much information that humans speak or write is unstructured. So it is not very clear for computers to interpret such. In natural language processing (NLP), the goal is to make computers understand the unstructured text and retrieve meaningful pieces of information from it. Natural language Processing (NLP) is a subfield of artificial intelligence, in which its depth involves the interactions between computers and humans.

Applications of NLP:

Machine Translation.
Speech Recognition.
Sentiment Analysis.
Question Answering.
Summarization of Text.
Chatbot.
Intelligent Systems.
Text Classifications.
Character Recognition.
Spell Checking.
Spam Detection.
Autocomplete.
Named Entity Recognition.
Predictive Typing.

Understanding Natural Language Processing (NLP):

Figure 1: Revealing, listening, and understand.

We, as humans, perform natural language processing (NLP) considerably well, but even then, we are not perfect. We often misunderstand one thing for another, and we often interpret the same sentences or words differently.

For instance, consider the following sentence, we will try to understand its interpretation in many different ways:

Example 1:

Figure 2: NLP Example Sentence with text: “I saw a man on a hill with a telescope.” — Figure 2: NLP example sentence with the text: “I saw a man on a hill with a telescope.”

These are some interpretations of the sentence shown above.

There is a man on the hill, and I watched him with my telescope.
There is a man on the hill, and he has a telescope.
I’m on a hill, and I saw a man using my telescope.
I’m on a hill, and I saw a man who has a telescope.
There is a man on a hill, and I saw him something with my telescope.

Example 2:

Figure 3: NLP example sentence with the text: “Can you help me with the can?”

In the sentence above, we can see that there are two “can” words, but both of them have different meanings. Here the first “can” word is used for question formation. The second “can” word at the end of the sentence is used to represent a container that holds food or liquid.

Hence, from the examples above, we can see that language processing is not “deterministic” (the same language has the same interpretations), and something suitable to one person might not be suitable to another. Therefore, Natural Language Processing (NLP) has a non-deterministic approach. In other words, Natural Language Processing can be used to create a new intelligent system that can understand how humans understand and interpret language in different situations.

Rule-based NLP vs. Statistical NLP:

Natural Language Processing is separated in two different approaches:

Rule-based Natural Language Processing:

It uses common sense reasoning for processing tasks. For instance, the freezing temperature can lead to death, or hot coffee can burn people’s skin, along with other common sense reasoning tasks. However, this process can take much time, and it requires manual effort.

Statistical Natural Language Processing:

It uses large amounts of data and tries to derive conclusions from it. Statistical NLP uses machine learning algorithms to train NLP models. After successful training on large amounts of data, the trained model will have positive outcomes with deduction.

Comparison:

Figure 4: Rule-Based NLP vs Statistical NLP. — Figure 4: Rule-Based NLP vs. Statistical NLP.

Components of Natural Language Processing (NLP):

a. Lexical Analysis:

With lexical analysis, we divide a whole chunk of text into paragraphs, sentences, and words. It involves identifying and analyzing words’ structure.

b. Syntactic Analysis:

Syntactic analysis involves the analysis of words in a sentence for grammar and arranging words in a manner that shows the relationship among the words. For instance, the sentence “The shop goes to the house” does not pass.

c. Semantic Analysis:

Semantic analysis draws the exact meaning for the words, and it analyzes the text meaningfulness. Sentences such as “hot ice-cream” do not pass.

d. Disclosure Integration:

Disclosure integration takes into account the context of the text. It considers the meaning of the sentence before it ends. For example: “He works at Google.” In this sentence, “he” must be referenced in the sentence before it.

e. Pragmatic Analysis:

Pragmatic analysis deals with overall communication and interpretation of language. It deals with deriving meaningful use of language in various situations.

📚 Check out an overview of machine learning algorithms for beginners with code examples in Python. 📚

Current challenges in NLP:

Breaking sentences into tokens.
Tagging parts of speech (POS).
Building an appropriate vocabulary.
Linking the components of a created vocabulary.
Understanding the context.
Extracting semantic meaning.
Named Entity Recognition (NER).
Transforming unstructured data into structured data.
Ambiguity in speech.

Easy to use NLP libraries:

a. NLTK (Natural Language Toolkit):

The NLTK Python framework is generally used as an education and research tool. It’s not usually used on production applications. However, it can be used to build exciting programs due to its ease of use.

Features:

Tokenization.
Part Of Speech tagging (POS).
Named Entity Recognition (NER).
Classification.
Sentiment analysis.
Packages of chatbots.

Use-cases:

Recommendation systems.
Sentiment analysis.
Building chatbots.

Figure 6: Pros and cons of using the NLTK framework.

b. spaCy:

spaCy is an open-source natural language processing Python library designed to be fast and production-ready. spaCy focuses on providing software for production usage.

Features:

Tokenization.
Part Of Speech tagging (POS).
Named Entity Recognition (NER).
Classification.
Sentiment analysis.
Dependency parsing.
Word vectors.

Use-cases:

Autocomplete and autocorrect.
Analyzing reviews.
Summarization.

Figure 7: Pros and cons of the spaCy framework.

c. Gensim:

Gensim is an NLP Python framework generally used in topic modeling and similarity detection. It is not a general-purpose NLP library, but it handles tasks assigned to it very well.

Features:

Latent semantic analysis.
Non-negative matrix factorization.
TF-IDF.

Use-cases:

Converting documents to vectors.
Finding text similarity.
Text summarization.

Figure 8: Pros and cons of the Gensim framework.

d. Pattern:

Pattern is an NLP Python framework with straightforward syntax. It’s a powerful tool for scientific and non-scientific tasks. It is highly valuable to students.

Features:

Tokenization.
Part of Speech tagging.
Named entity recognition.
Parsing.
Sentiment analysis.

Use-cases:

Spelling correction.
Search engine optimization.
Sentiment analysis.

Figure 9: Pros and cons of the Pattern framework.

e. TextBlob:

TextBlob is a Python library designed for processing textual data.

Features:

Part-of-Speech tagging.
Noun phrase extraction.
Sentiment analysis.
Classification.
Language translation.
Parsing.
Wordnet integration.

Use-cases:

Sentiment Analysis.
Spelling Correction.
Translation and Language Detection.

Figure 10: Pros and cons of the TextBlob library.

For this tutorial, we are going to focus more on the NLTK library. Let’s dig deeper into natural language processing by making some examples.

Exploring Features of NLTK:

a. Open the text file for processing:

First, we are going to open and read the file which we want to analyze.

Figure 11: Small code snippet to open and read the text file and analyze it.

Next, notice that the data type of the text file read is a String. The number of characters in our text file is 675.

b. Import required libraries:

For various data processing cases in NLP, we need to import some libraries. In this case, we are going to use NLTK for Natural Language Processing. We will use it to perform various operations on the text.

Figure 13: Importing the required libraries.

c. Sentence tokenizing:

By tokenizing the text with sent_tokenize( ), we can get the text as sentences.

Figure 14: Using sent_tokenize( ) to tokenize the text as sentences.

In the example above, we can see the entire text of our data is represented as sentences and also notice that the total number of sentences here is 9.

d. Word tokenizing:

By tokenizing the text with word_tokenize( ), we can get the text as words.

Figure 16: Using word_tokenize() to tokenize the text as words.

Next, we can see the entire text of our data is represented as words and also notice that the total number of words here is 144.

e. Find the frequency distribution:

Let’s find out the frequency of words in our text.

Figure 18: Using FreqDist() to find the frequency of words in our sample text.

Figure 19: Printing the ten most common words from the sample text.

Notice that the most used words are punctuation marks and stopwords. We will have to remove such words to analyze the actual text.

f. Plot the frequency graph:

Let’s plot a graph to visualize the word distribution in our text.

Figure 20: Plotting a graph to visualize the text distribution.

In the graph above, notice that a period “.” is used nine times in our text. Analytically speaking, punctuation marks are not that important for natural language processing. Therefore, in the next step, we will be removing such punctuation marks.

g. Remove punctuation marks:

Next, we are going to remove the punctuation marks as they are not very useful for us. We are going to use isalpha( ) method to separate the punctuation marks from the actual text. Also, we are going to make a new list called words_no_punc, which will store the words in lower case but exclude the punctuation marks.

Figure 21: Using the isalpha() method to separate the punctuation marks, along with creating a list under words_no_punc to separate words with no punctuation marks.

As shown above, all the punctuation marks from our text are excluded. These can also cross-check with the number of words.

h. Plotting graph without punctuation marks:

Figure 23: Printing the ten most common words from the sample text.

Figure 24: Plotting the graph without punctuation marks.

Notice that we still have many words that are not very useful in the analysis of our text file sample, such as “and,” “but,” “so,” and others. Next, we need to remove coordinating conjunctions.

i. List of stopwords:

Figure 25: Importing the list of stopwords.

j. Removing stopwords:

Figure 27: Cleaning the text sample data.

k. Final frequency distribution:

Figure 29: Displaying the final frequency distribution of the most common words found.

Figure 30: Visualization of the most common words found in the group.

As shown above, the final graph has many useful words that help us understand what our sample data is about, showing how essential it is to perform data cleaning on NLP.

Next, we will cover various topics in NLP with coding examples.

Word Cloud:

Word Cloud is a data visualization technique. In which words from a given text display on the main chart. In this technique, more frequent or essential words display in a larger and bolder font, while less frequent or essential words display in smaller or thinner fonts. It is a beneficial technique in NLP that gives us a glance at what text should be analyzed.

Properties:

font_path: It specifies the path for the fonts we want to use.
width: It specifies the width of the canvas.
height: It specifies the height of the canvas.
min_font_size: It specifies the smallest font size to use.
max_font_size: It specifies the largest font size to use.
font_step: It specifies the step size for the font.
max_words: It specifies the maximum number of words on the word cloud.
stopwords: Our program will eliminate these words.
background_color: It specifies the background color for canvas.
normalize_plurals: It removes the trailing “s” from words.

Read the full documentation on WordCloud.

Word Cloud Python Implementation:

Figure 31: Python code implementation of the word cloud.

As shown in the graph above, the most frequent words display in larger fonts. The word cloud can be displayed in any shape or image.

For instance: In this case, we are going to use the following circle image, but we can use any shape or any image.

Figure 33: Circle image shape for our word cloud.

Word Cloud Python Implementation:

Figure 34: Python code implementation of the word cloud.

Figure 35: Word cloud with the circle shape.

As shown above, the word cloud is in the shape of a circle. As we mentioned before, we can use any shape or image to form a word cloud.

Word CloudAdvantages:

They are fast.
They are engaging.
They are simple to understand.
They are casual and visually appealing.

Word Cloud Disadvantages:

They are non-perfect for non-clean data.
They lack the context of words.

Stemming:

We use Stemming to normalize words. In English and many other languages, a single word can take multiple forms depending upon context used. For instance, the verb “study” can take many forms like “studies,” “studying,” “studied,” and others, depending on its context. When we tokenize words, an interpreter considers these input words as different words even though their underlying meaning is the same. Moreover, as we know that NLP is about analyzing the meaning of content, to resolve this problem, we use stemming.

Stemming normalizes the word by truncating the word to its stem word. For example, the words “studies,” “studied,” “studying” will be reduced to “studi,” making all these word forms to refer to only one token. Notice that stemming may not give us a dictionary, grammatical word for a particular set of words.

Let’s take an example:

a. Porter’s Stemmer Example 1:

In the code snippet below, we show that all the words truncate to their stem words. However, notice that the stemmed word is not a dictionary word.

Code snippet showing an NLP stemming example. — Figure 36: Code snippet showing a stemming example.

b. Porter’s Stemmer Example 2:

In the code snippet below, many of the words after stemming did not end up being a recognizable dictionary word.

c. SnowballStemmer:

SnowballStemmer generates the same output as porter stemmer, but it supports many more languages.

d. Languages supported by snowball stemmer:

Various Stemming Algorithms:

a. Porter’s Stemmer:

Figure 40: Porter’s Stemmer NLP algorithm, pros, and cons.

b. Lovin’s Stemmer:

Figure 41: Lovin’s Stemmer NLP algorithm, pros, and cons.

c. Dawson’s Stemmer:

Figure 42: Dawson’s Stemmer NLP algorithm, pros, and cons.

d. Krovetz Stemmer:

Figure 43: Krovetz Stemmer NLP algorithm, pros, and cons.

e. Xerox Stemmer:

Figure 44: Xerox Stemmer NLP algorithm, pros, and cons.

f. Snowball Stemmer:

Figure 45: Snowball Stemmer NLP algorithm, pros, and cons.

📚 Check out our tutorial on neural networks from scratch with Python code and math in detail.📚

Lemmatization:

Lemmatization tries to achieve a similar base “stem” for a word. However, what makes it different is that it finds the dictionary word instead of truncating the original word. Stemming does not consider the context of the word. That is why it generates results faster, but it is less accurate than lemmatization.

If accuracy is not the project’s final goal, then stemming is an appropriate approach. If higher accuracy is crucial and the project is not on a tight deadline, then the best option is amortization (Lemmatization has a lower processing speed, compared to stemming).

Lemmatization takes into account Part Of Speech (POS) values. Also, lemmatization may generate different outputs for different values of POS. We generally have four choices for POS:

Figure 46: Part of Speech (POS) values in lemmatization.

Difference between Stemmer and Lemmatizer:

a. Stemming:

Notice how on stemming, the word “studies” gets truncated to “studi.”

Figure 47: Using stemming with the NLTK Python framework.

b. Lemmatizing:

During lemmatization, the word “studies” displays its dictionary word “study.”

Figure 48: Using lemmatization with the NLTK Python framework.

Python Implementation:

a. A basic example demonstrating how a lemmatizer works

In the following example, we are taking the PoS tag as “verb,” and when we apply the lemmatization rules, it gives us dictionary words instead of truncating the original word:

Figure 49: Simple lemmatization example with the NLTK framework.

b. Lemmatizer with default PoS value

The default value of PoS in lemmatization is a noun(n). In the following example, we can see that it’s generating dictionary words:

Figure 50: using lemmatization to generate default values. — Figure 50: Using lemmatization to generate default values.

c. Another example demonstrating the power of lemmatizer

Figure 51: Lemmatization of the words: “am”, “are”, “is”, “was”, “were”

d. Lemmatizer with different POS values

Figure 52: Lemmatization with different Part of Speech values. — Figure 52: Lemmatization with different Part-of-Speech values.

Part of Speech Tagging (PoS tagging):

Why do we need Part of Speech (POS)?

Figure 53: Sentence example, “can you help me with the can?”

Parts of speech(PoS) tagging is crucial for syntactic and semantic analysis. Therefore, for something like the sentence above, the word “can” has several semantic meanings. The first “can” is used for question formation. The second “can” at the end of the sentence is used to represent a container. The first “can” is a verb, and the second “can” is a noun. Giving the word a specific meaning allows the program to handle it correctly in both semantic and syntactic analysis.

Below, please find a list of Part of Speech (PoS) tags with their respective examples:

1. CC: Coordinating Conjunction

Figure 54: Coordinating conjunction example.

2. CD: Cardinal Digit

3. DT: Determiner

4. EX: Existential There

Figure 57: Existential there example. — Figure 57: Existential “there” example.

5. FW: Foreign Word

6. IN: Preposition / Subordinating Conjunction

Figure 59: Preposition/Subordinating conjunction.

7. JJ: Adjective

8. JJR: Adjective, Comparative

Figure 61: Adjective, comparative example.

9. JJS: Adjective, Superlative

10. LS: List Marker

11. MD: Modal

12. NN: Noun, Singular

13. NNS: Noun, Plural

14. NNP: Proper Noun, Singular

Figure 67: Proper noun, singular example.

15. NNPS: Proper Noun, Plural

16. PDT: Predeterminer

17. POS: Possessive Endings

18. PRP: Personal Pronoun

19. PRP$: Possessive Pronoun

20. RB: Adverb

21. RBR: Adverb, Comparative

22. RBS: Adverb, Superlative

23. RP: Particle

24. TO: To

25. UH: Interjection

26. VB: Verb, Base Form

27. VBD: Verb, Past Tense

28. VBG: Verb, Present Participle

Figure 81: Verb, present participle example.

29. VBN: Verb, Past Participle

30. VBP: Verb, Present Tense, Not Third Person Singular

Figure 83: Verb, present tense, not third-person singular.

31. VBZ: Verb, Present Tense, Third Person Singular

Figure 84: Verb, present tense, third person singular. — Figure 84: Verb, present tense, third-person singular.

32. WDT: Wh — Determiner

33. WP: Wh — Pronoun

34. WP$ : Possessive Wh — Pronoun

35. WRB: Wh — Adverb

Python Implementation:

a. A simple example demonstrating PoS tagging.

b. A full example demonstrating the use of PoS tagging.

Figure 90: Full Python sample demonstrating PoS tagging.

Chunking:

Chunking means to extract meaningful phrases from unstructured text. By tokenizing a book into words, it’s sometimes hard to infer meaningful information. It works on top of Part of Speech(PoS) tagging. Chunking takes PoS tags as input and provides chunks as output. Chunking literally means a group of words, which breaks simple text into phrases that are more meaningful than individual words.

Before working with an example, we need to know what phrases are? Meaningful groups of words are called phrases. There are five significant categories of phrases.

Noun Phrases (NP).
Verb Phrases (VP).
Adjective Phrases (ADJP).
Adverb Phrases (ADVP).
Prepositional Phrases (PP).

Phrase structure rules:

S(Sentence) → NP VP.
NP → {Determiner, Noun, Pronoun, Proper name}.
VP → V (NP)(PP)(Adverb).
PP → Pronoun (NP).
AP → Adjective (PP).

Example:

Python Implementation:

In the following example, we will extract a noun phrase from the text. Before extracting it, we need to define what kind of noun phrase we are looking for, or in other words, we have to set the grammar for a noun phrase. In this case, we define a noun phrase by an optional determiner followed by adjectives and nouns. Then we can define other rules to extract some other phrases. Next, we are going to use RegexpParser( ) to parse the grammar. Notice that we can also visualize the text with the .draw( ) function.

Figure 93: Code snippet to extract noun phrases from a text file.

In this example, we can see that we have successfully extracted the noun phrase from the text.

Figure 94: Successful extraction of the noun phrase from the input text.

Chinking:

Chinking excludes a part from our chunk. There are certain situations where we need to exclude a part of the text from the whole text or chunk. In complex extractions, it is possible that chunking can output unuseful data. In such case scenarios, we can use chinking to exclude some parts from that chunked text.
In the following example, we are going to take the whole string as a chunk, and then we are going to exclude adjectives from it by using chinking. We generally use chinking when we have a lot of unuseful data even after chunking. Hence, by using this method, we can easily set that apart, also to write chinking grammar, we have to use inverted curly braces, i.e.:

} write chinking grammar here {

Python Implementation:

Figure 95: NLP Chinking implementation with Python. — Figure 95: Chinking implementation with Python.

From the example above, we can see that adjectives separate from the other text.

Figure 96: In this example, adjectives are excluded by using NLP chinking. — Figure 96: In this example, adjectives are excluded by using chinking.

Named Entity Recognition (NER):

Named entity recognition can automatically scan entire articles and pull out some fundamental entities like people, organizations, places, date, time, money, and GPE discussed in them.

Use-Cases:

Content classification for news channels.
Summarizing resumes.
Optimizing search engine algorithms.
Recommendation systems.
Customer support.

Commonly used types of named entity:

Figure 97: An example of ommonly used types of NLP named entity recognition (NER). — Figure 97: An example of commonly used types of named entity recognition (NER).

Python Implementation:

There are two options :

1. binary = True

When the binary value is True, then it will only show whether a particular entity is named entity or not. It will not show any further details on it.

Figure 98: Python implementation when a binary value is True.

Our graph does not show what type of named entity it is. It only shows whether a particular word is named entity or not.

Figure 99: Graph example of when a binary value is True.

2. binary = False

When the binary value equals False, it shows in detail the type of named entities.

Figure 100: Python implementation when a binary value is False.

Our graph now shows what type of named entity it is.

Figure 101: Graph showing the type of named entities when a binary value equals false.

WordNet:

Wordnet is a lexical database for the English language. Wordnet is a part of the NLTK corpus. We can use Wordnet to find meanings of words, synonyms, antonyms, and many other words.

a. We can check how many different definitions of a word are available in Wordnet.

Figure 102: Checking word definitions with wordnet using the NLTK framework. — Figure 102: Checking word definitions with Wordnet using the NLTK framework.

b. We can also check the meaning of those different definitions.

Figure 103: Gathering the meaning of the different definitions by using Wordnet and Python. — Figure 103: Gathering the meaning of the different definitions by using Wordnet.

c. All details for a word.

Figure: 104: Finding all the details for a specific word with Wordnet and Python. — Figure: 104: Finding all the details for a specific word.

d. All details for all meanings of a word.

Figure 105: Finding all details for all the meanings of a specific word using Wordnet and Python. — Figure 105: Finding all details for all the meanings of a specific word.

e. Hypernyms: Hypernyms gives us a more abstract term for a word.

Figure 106: Using Wordnet to find a Hypernym with Python. — Figure 106: Using Wordnet to find a hypernym.

f. Hyponyms: Hyponyms gives us a more specific term for a word.

Figure 107: using Wordnet to find a hyponym with Python. — Figure 107: Using Wordnet to find a hyponym.

g. Get a name only.

Figure 108: Finding only a name with Wordnet and Python. — Figure 108: Finding only a name with Wordnet.

h. Synonyms.

Figure 109: Finding synonyms with Wordnet and Python. — Figure 109: Finding synonyms with Wordnet.

i. Antonyms.

Figure 110: Finding antonyms with Wordnet using Python. — Figure 110: Finding antonyms with Wordnet.

j. Synonyms and antonyms.

Figure 111: Finding synonyms and antonyms code snippet with Wordnet using Python. — Figure 111: Finding synonyms and antonyms code snippet with Wordnet.

k. Finding the similarity between words.

Figure 112: Finding a similarity between words using Wordnet and Python. — Figure 112: Finding the similarity ratio between words using Wordnet.

Figure 112: Finding the similarity ratio between words using Wordnet. — Figure 113: Finding the similarity ratio between words using Wordnet.

Bag of Words:

What is the Bag-of-Words method?

It is a method of extracting essential features from row text so that we can use it for machine learning models. We call it “Bag” of words because we discard the order of occurrences of words. A bag of words model converts the raw text into words, and it also counts the frequency for the words in the text. In summary, a bag of words is a collection of words that represent a sentence along with the word count where the order of occurrences is not relevant.

Figure 115: Structure of a bag of words.

Raw Text: This is the original text on which we want to perform analysis.
Clean Text: Since our raw text contains some unnecessary data like punctuation marks and stopwords, so we need to clean up our text. Clean text is the text after removing such words.
Tokenize: Tokenization represents the sentence as a group of tokens or words.
Building Vocab: It contains total words used in the text after removing unnecessary data.
Generate Vocab: It contains the words along with their frequencies in the sentences.

For instance:

Sentences:

Jim and Pam traveled by bus.
The train was late.
The flight was full. Traveling by flight is expensive.

a. Creating a basic structure:

Figure 116: Example of a basic structure for a bag of words in NLP. — Figure 116: Example of a basic structure for a bag of words.

b. Words with frequencies:

Figure 117: Example of a basic structure for words with frequencies in natural language processing. — Figure 117: Example of a basic structure for words with frequencies.

c. Combining all the words:

Figure 118: Combination of all the input words.

d. Final model:

Figure 119: The final model of our bag of words.

Python Implementation:

Applications:

Natural language processing.
Information retrieval from documents.
Classifications of documents.

Limitations:

Semantic meaning: It does not consider the semantic meaning of a word. It ignores the context in which the word is used.
Vector size: For large documents, the vector size increase, which may result in higher computational time.
Preprocessing: In preprocessing, we need to perform data cleansing before using it.

TF-IDF

TF-IDF stands for Term Frequency — Inverse Document Frequency, which is a scoring measure generally used in information retrieval (IR) and summarization. The TF-IDF score shows how important or relevant a term is in a given document.

The intuition behind TF and IDF:

If a particular word appears multiple times in a document, then it might have higher importance than the other words that appear fewer times (TF). At the same time, if a particular word appears many times in a document, but it is also present many times in some other documents, then maybe that word is frequent, so we cannot assign much importance to it. (IDF). For instance, we have a database of thousands of dog descriptions, and the user wants to search for “a cute dog” from our database. The job of our search engine would be to display the closest response to the user query. How would a search engine do that? The search engine will possibly use TF-IDF to calculate the score for all of our descriptions, and the result with the higher score will be displayed as a response to the user. Now, this is the case when there is no exact match for the user’s query. If there is an exact match for the user query, then that result will be displayed first. Then, let’s suppose there are four descriptions available in our database.

The furry dog.
A cute doggo.
A big dog.
The lovely doggo.

Notice that the first description contains 2 out of 3 words from our user query, and the second description contains 1 word from the query. The third description also contains 1 word, and the forth description contains no words from the user query. As we can sense that the closest answer to our query will be description number two, as it contains the essential word “cute” from the user’s query, this is how TF-IDF calculates the value.

Notice that the term frequency values are the same for all of the sentences since none of the words in any sentences repeat in the same sentence. So, in this case, the value of TF will not be instrumental. Next, we are going to use IDF values to get the closest answer to the query. Notice that the word dog or doggo can appear in many many documents. Therefore, the IDF value is going to be very low. Eventually, the TF-IDF value will also be lower. However, if we check the word “cute” in the dog descriptions, then it will come up relatively fewer times, so it increases the TF-IDF value. So the word “cute” has more discriminative power than “dog” or “doggo.” Then, our search engine will find the descriptions that have the word “cute” in it, and in the end, that is what the user was looking for.

Simply put, the higher the TF*IDF score, the rarer or unique or valuable the term and vice versa.

Now we are going to take a straightforward example and understand TF-IDF in more detail.

Example:

Sentence 1: This is the first document.

Sentence 2: This document is the second document.

TF: Term Frequency

Figure 123: Calculation for the term frequency on TF-IDF.

a. Represent the words of the sentences in the table.

Figure 124: Table representation of the sentences using TF-IDF. — Figure 124: Table representation of the sentences.

b. Displaying the frequency of words.

Figure 125: Table showing the frequency of words using TF-IDF. — Figure 125: Table showing the frequency of words.

c. Calculating TF using a formula.

Figure 127: Resulting TF TF-IDF. — Figure 127: Resulting TF.

IDF: Inverse Document Frequency

d. Calculating IDF values from the formula.

Figure 129: Calculating IDF values from the formula.

e. Calculating TF-IDF.

TF-IDF is the multiplication of TF*IDF.

Figure 130: The resulting multiplication of TF-IDF.

In this case, notice that the import words that discriminate both the sentences are “first” in sentence-1 and “second” in sentence-2 as we can see, those words have a relatively higher value than other words.

However, there any many variations for smoothing out the values for large documents. The most common variation is to use a log value for TF-IDF. Let’s calculate the TF-IDF value again by using the new IDF value.

Figure 131: Using a log value for TF-IDF by using the new IDF value.

f. Calculating IDF value using log.

Figure 132: Calculating the IDF value using log. — Figure 132: Calculating the IDF value using a log.

g. Calculating TF-IDF.

Figure 133: Calculating TF-IDF using a log.

As seen above, “first” and “second” values are important words that help us to distinguish between those two sentences.

Now that we saw the basics of TF-IDF. Next, we are going to use the sklearn library to implement TF-IDF in Python. A different formula calculates the actual output from our program. First, we will see an overview of our calculations and formulas, and then we will implement it in Python.

Actual Calculations:

a. Term Frequency (TF):

b. Inverse Document Frequency (IDF):

Figure 136: Applying a log to the IDF values.

c. Calculating final TF-IDF values:

Figure 137: Calculating the final IDF values.

Python Implementation:

Conclusion:

These are some of the basics for the exciting field of natural language processing (NLP). We hope you enjoyed reading this article and learned something new. Any suggestions or feedback is crucial to continue to improve. Please let us know in the comments if you have any.

DISCLAIMER: The views expressed in this article are those of the author(s) and do not represent the views of Carnegie Mellon University, nor other companies (directly or indirectly) associated with the author(s). These writings do not intend to be final products, yet rather a reflection of current thinking, along with being a catalyst for discussion and improvement.

Published via Towards AI

Citation

For attribution in academic contexts, please cite this work as:

Shukla, et al., “Natural Language Processing (NLP) with Python — Tutorial”, Towards AI, 2020

BibTex citation:

@article{pratik_iriondo_2020, 
 title={Natural Language Processing (NLP) with Python — Tutorial}, 
 url={https://towardsai.net/nlp-tutorial-with-python}, 
 journal={Towards AI}, 
 publisher={Towards AI Co.}, 
 author={Pratik, Shukla and Iriondo, Roberto},  
 year={2020}, 
 month={Jul}
}

References:

[1] The example text was gathered from American Literature, https://americanliterature.com/

[2] Natural Language Toolkit, https://www.nltk.org/

[3] TF-IDF, KDnuggets, https://www.kdnuggets.com/2018/08/wtf-tf-idf.html

Resources:

Google Colab Implementation.

Github Tutorial Full Code Repository.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Natural Language Processing (NLP) with Python — Tutorial

Author(s): Pratik Shukla, Roberto Iriondo

Natural Language Processing, Scholarly, Tutorial

Tutorial on the basics of natural language processing (NLP) with sample code implementation in Python

Table of Contents:

What is Natural Language Processing?

Applications of NLP:

Understanding Natural Language Processing (NLP):

Rule-based NLP vs. Statistical NLP:

Rule-based Natural Language Processing:

Statistical Natural Language Processing:

Comparison:

Components of Natural Language Processing (NLP):

a. Lexical Analysis:

b. Syntactic Analysis:

c. Semantic Analysis:

d. Disclosure Integration:

e. Pragmatic Analysis:

Current challenges in NLP:

Easy to use NLP libraries:

a. NLTK (Natural Language Toolkit):

b. spaCy:

c. Gensim:

d. Pattern:

e. TextBlob:

Exploring Features of NLTK:

a. Open the text file for processing:

b. Import required libraries:

c. Sentence tokenizing:

d. Word tokenizing:

e. Find the frequency distribution:

f. Plot the frequency graph:

g. Remove punctuation marks:

h. Plotting graph without punctuation marks:

i. List of stopwords:

j. Removing stopwords:

k. Final frequency distribution:

Word Cloud:

Properties:

Word Cloud Python Implementation:

Word Cloud Python Implementation:

Word CloudAdvantages:

Word Cloud Disadvantages:

Stemming:

a. Porter’s Stemmer Example 1:

b. Porter’s Stemmer Example 2:

c. SnowballStemmer:

d. Languages supported by snowball stemmer:

Various Stemming Algorithms:

a. Porter’s Stemmer:

b. Lovin’s Stemmer:

c. Dawson’s Stemmer:

d. Krovetz Stemmer:

e. Xerox Stemmer:

f. Snowball Stemmer:

Lemmatization:

Difference between Stemmer and Lemmatizer:

Python Implementation:

Part of Speech Tagging (PoS tagging):

Why do we need Part of Speech (POS)?

Python Implementation:

Chunking:

Phrase structure rules:

Example:

Python Implementation:

Chinking:

Python Implementation:

Named Entity Recognition (NER):

Use-Cases:

Commonly used types of named entity:

Python Implementation:

WordNet:

a. We can check how many different definitions of a word are available in Wordnet.

b. We can also check the meaning of those different definitions.

c. All details for a word.

d. All details for all meanings of a word.

e. Hypernyms: Hypernyms gives us a more abstract term for a word.

f. Hyponyms: Hyponyms gives us a more specific term for a word.