Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Natural Language Processing Beginner’s Guide
Latest   Machine Learning

Natural Language Processing Beginner’s Guide

Last Updated on July 24, 2023 by Editorial Team

Author(s): Davuluri Hemanth Chowdary

Originally published on Towards AI.

Natural Language Processing

Copyright: Wave Accounting

I published an article on Natural Language Processing. Before going through this article, I recommend you to go through the previous article.

NLP Made Easy With These Libraries

What is Natural Language Processing?

medium.com

Common Procedure in Almost Every NLP problem

  1. Tokenization
  2. Stemming
  3. Lemmatization
  4. POS tags
  5. Named Entity Recognition
  6. Chunking

Let me take you through each of those steps:-

For a more informative article, I will have used a few snippets in between so that you can follow along.

Installation Guide Here

1. Tokenization

The process of breaking down the sentences into words and during this step, the punctuations are removed from the tokens.

Example:-

# import and download the required library data
import nltk
nltk.download()# sentence
text = "Don't forget to follow my profile"
#import the tokenization library
from nltk.tokenize import word_tokenize
#Tokenization
print(word_tokenize(text)

Output:

['Do', "n't", 'forget', 'to', 'follow', 'my', 'profile']

If you look at the output the word “don’t” is split into “Do” and “n’t”. This function chunk the word properly.

NOTE:

Before Stemming, we need to perform an additional step and which is the removal of stop word. It is the step where meaningless and words that do not give any information from the sentence are removed.

#Import Required Library
from nltk.corpus import stopwords
#Now store all the stopwords into a variable
stop_words = set(stopwords.words('english'))
#Sentence
tokens = word_tokenize(text)
#Create an empty list to store the filtered words
filtered_sentence = []
#Removal of stopwords
for w in tokens:
if w not in stop_words:
filtered_sentence.append(w)
#Printing the output
print(tokens)
print(filtered_sentence)

Output:

['Do', "n't", 'forget', 'to', 'follow', 'my', 'profile']
['Do', "n't", 'forget', 'follow', 'profile']

The words “to” and “my” are removed from the tokens. If you clearly look at the output there is another word that does not make sense using it afterward. In cases like this, you can create your own stop words list. Here’s the process:

#Own Stop words list
stop_words_own = ["n't","if","by"]
#Create an empty list to store the filtered words
filtered_sent = []
#Removal of stopwords
for w in filtered_sentence:
if w not in stop_words_own:
filtered_sent.append(w)
#printing the output
print(filtered_sentence)
print(filtered_sent)

Output:

['Do', "n't", 'forget', 'follow', 'profile']
['Do', 'forget', 'follow', 'profile']

2. Stemming

It is a word normalization technique. What exactly happens the word is converted into its root form.

Here’s the documentation from the official site on stemming. I will be using Port Stemmer in the below example.

#Import required library
from nltk.stem import PorterStemmer
#Initialize the Class
ps = PorterStemmer()
#Stemmer
for w in filtered_sent:
print(ps.stem(w))

Output:

Do
forget
follow
profil

If you look at the output, the word “profile” is chunk into “profile”. Each stemmer works differently. All you need to do is use the required stemmer for your problem.

3. Lemmatization

This process is very similar to stemming.

#Import the required library
from nltk.stem import WordNetLemmatizer
#Initialize the class
lemmatizer = WordNetLemmatizer()
#Few examples
print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("run"))
print(lemmatizer.lemmatize("run",'v'))

Output:

cat
cactus
goose
rock
python
good
best
run
run

4. POS Tags

The process of classifying the words into their parts of speech is called POS tagging.

tags = nltk.pos_tag(filtered_sent)
print(tags)

Output

[('Do', 'NNP'), ('forget', 'VB'), ('follow', 'VB'), ('profile', 'NN')]

5. Named Entity Recognition

The process of pulling out entities like people, places, locations, etc.

#Import the required libraries
from nltk import word_tokenize, pos_tag, ne_chunk
#Sentence
sentence = "Mark and John are working at Google."
# Named Entity Recognition
print(ne_chunk(pos_tag(word_tokenize(sentence))))

Output:

(S
(PERSON Mark/NNP)
and/CC
(PERSON John/NNP)
are/VBP
working/VBG
at/IN
(ORGANIZATION Google/NNP)
./.)

Conclusion:

I hope that now you are familiar with all the basic and common steps of every NLP application. Don’t forget to follow me because there are many other interesting articles coming up in the future.

Contact:

Feel free to connect with me on LinkedIn Davuluri Hemanth Chowdary

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓