Natural Language Processing Beginner’s Guide

# import and download the required library data
import nltknltk.download()# sentence
text = "Don't forget to follow my profile"#import the tokenization library
from nltk.tokenize import word_tokenize#Tokenization
print(word_tokenize(text)

Output:

['Do', "n't", 'forget', 'to', 'follow', 'my', 'profile']

If you look at the output the word “don’t” is split into “Do” and “n’t”. This function chunk the word properly.

NOTE:

Before Stemming, we need to perform an additional step and which is the removal of stop word. It is the step where meaningless and words that do not give any information from the sentence are removed.

#Import Required Library
from nltk.corpus import stopwords#Now store all the stopwords into a variable
stop_words = set(stopwords.words('english'))#Sentence
tokens = word_tokenize(text)#Create an empty list to store the filtered words
filtered_sentence = []#Removal of stopwords
for w in tokens:
 if w not in stop_words:
 filtered_sentence.append(w)#Printing the output
print(tokens)
print(filtered_sentence)

Output:

['Do', "n't", 'forget', 'to', 'follow', 'my', 'profile']
['Do', "n't", 'forget', 'follow', 'profile']

The words “to” and “my” are removed from the tokens. If you clearly look at the output there is another word that does not make sense using it afterward. In cases like this, you can create your own stop words list. Here’s the process:

#Own Stop words list
stop_words_own = ["n't","if","by"]#Create an empty list to store the filtered words
filtered_sent = []#Removal of stopwords
for w in filtered_sentence:
 if w not in stop_words_own:
 filtered_sent.append(w)#printing the output
print(filtered_sentence)
print(filtered_sent)

Output:

['Do', "n't", 'forget', 'follow', 'profile']
['Do', 'forget', 'follow', 'profile']

2. Stemming

It is a word normalization technique. What exactly happens the word is converted into its root form.

Here’s the documentation from the official site on stemming. I will be using Port Stemmer in the below example.

#Import required library
from nltk.stem import PorterStemmer#Initialize the Class 
ps = PorterStemmer()#Stemmer
for w in filtered_sent:
 print(ps.stem(w))

Output:

Do
forget
follow
profil

If you look at the output, the word “profile” is chunk into “profile”. Each stemmer works differently. All you need to do is use the required stemmer for your problem.

3. Lemmatization

This process is very similar to stemming.

#Import the required library
from nltk.stem import WordNetLemmatizer#Initialize the class
lemmatizer = WordNetLemmatizer()#Few examples
print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("run"))
print(lemmatizer.lemmatize("run",'v'))

Output:

cat
cactus
goose
rock
python
good
best
run
run

4. POS Tags

The process of classifying the words into their parts of speech is called POS tagging.

tags = nltk.pos_tag(filtered_sent)
print(tags)

Output

[('Do', 'NNP'), ('forget', 'VB'), ('follow', 'VB'), ('profile', 'NN')]

5. Named Entity Recognition

The process of pulling out entities like people, places, locations, etc.

#Import the required libraries
from nltk import word_tokenize, pos_tag, ne_chunk#Sentence
sentence = "Mark and John are working at Google."# Named Entity Recognition
print(ne_chunk(pos_tag(word_tokenize(sentence))))

Output:

(S
 (PERSON Mark/NNP)
 and/CC
 (PERSON John/NNP)
 are/VBP
 working/VBG
 at/IN
 (ORGANIZATION Google/NNP)
 ./.)

Conclusion:

I hope that now you are familiar with all the basic and common steps of every NLP application. Don’t forget to follow me because there are many other interesting articles coming up in the future.

Contact:

Feel free to connect with me on LinkedIn Davuluri Hemanth Chowdary

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Natural Language Processing Beginner’s Guide

Author(s): Davuluri Hemanth Chowdary