Demystifying NLP: A Beginner’s Guide to Natural Language Processing Basics and Techniques

Last Updated on November 5, 2023 by Editorial Team

Author(s): Abdulraqib Omotosho

Originally published on Towards AI.

Natural Language Processing (NLP) is an exciting field in Machine Learning that empowers machines to comprehend, interpret, understand and generate human language. It is basically the technology that allows computers to read, understand, and respond to human language — be it in the form of text or speech. It’s like teaching machines to speak our language and even generate creative, coherent, and contextually relevant responses. In this article, we will delve into the basic terms, terminologies and fundamentals of the technology that underpins ChatGPT and the plethora of generative AI systems prevalent today.

Key Terms Used Include:

Document

A piece of text, which can range from a single sentence to an entire book.

Corpus

A collection of documents used for analysis or training in NLP.

corpus = ["This is the first document.", "Another document here.", "And a third one."]

Vocabulary

The set of all unique words in a corpus.

from collections import Counter

corpus = ["This is the first document.", "Another document here.", "And a third one."]
words = ' '.join(corpus).lower().split()
vocabulary = set(words)
print("Vocabulary:", vocabulary)

Vocabulary: {'document.', 'document', 'first', 'a', 'another', 'and', 'third', 'this', 'here.', 'is', 'the', 'one.'}

Segmentation

The process of breaking a text into meaningful segments, like sentences or paragraphs.

Tokenization

Breaking text into smaller units, such as words or subwords (tokens).

from nltk.tokenize import word_tokenize

text = "Tokenization is an important step in NLP."
tokens = word_tokenize(text)
print("Tokens:", tokens)

Tokens: ['Tokenization', 'is', 'an', 'important', 'step', 'in', 'NLP', '.']

Stopwords

Commonly used words (e.g., ‘and’, ‘the’, ‘is’) that are usually removed in NLP analysis.

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
print("Stopwords:", stop_words)

Stopwords: {'he', 'once', 'm', 'this', "that'll", 'their', 'the', "she's", "you've", "hadn't", 'mightn', 'down', 'off', 'now', "shouldn't", 't', "isn't", 'that', "didn't", 'wasn', 'do', 'shan', 'yourselves', 'a', 've', 'themselves', 'out', "don't", 'hasn', 'than', 'couldn', "mightn't", "you'd", 'further', 'has', 'having', "wouldn't", 'here', 'him', 'from', 'where', 'your', 'these', 'my', 'up', 'so', 'have', 'hadn', "weren't", 'to', 'hers', 'doesn', 'below', "needn't", 're', "you're", 'when', 'whom', 'all', 'is', 'should', 'not', 'were', 'you', 'until', 'doing', "mustn't", "it's", 'because', 'y', 'her', 'both', 'o', 'weren', 'other', 'on', 'his', "shan't", 'why', 'through', 'between', 'am', 'be', 'she', 'more', 'herself', "couldn't", "doesn't", "aren't", 'which', 'won', 'of', 'don', 'some', 'was', 'under', 'few', 'needn', 'ours', 'theirs', 'it', 'aren', 'and', 'own', 'isn', 'about', 'such', 'again', 'its', 'any', 'by', 'mustn', 'had', 's', 'can', 'haven', 'before', 'over', 'those', 'during', 'while', "wasn't", 'we', 'each', 'being', 'then', 'against', 'me', "should've", 'd', 'after', 'didn', 'as', 'll', "haven't", 'wouldn', 'there', 'an', 'been', 'ourselves', "you'll", 'what', 'if', 'in', 'shouldn', 'for', 'with', 'just', 'how', 'who', 'them', 'are', 'but', 'no', 'ain', 'very', 'ma', 'same', 'above', 'into', 'himself', 'did', 'myself', 'most', 'only', 'will', 'our', 'they', 'nor', 'yours', 'at', 'too', "hasn't", 'itself', 'or', "won't", 'does', 'i', 'yourself'}

Stemming

Reducing words to their base or root form (stem).

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
word = "running"
stemmed_word = stemmer.stem(word)
print("Stemmed word:", stemmed_word)


Stemmed word: run

Lemmatization

Reducing words to their base form (lemma) considering the context.

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
word = "Raqib loves coding and dancing."
lemmatized_word = lemmatizer.lemmatize(word)
print("Lemmatized word:", lemmatized_word)

POS Tagging (Part-of-Speech Tagging)

Assigning a part-of-speech (e.g., noun, verb, adjective) to each word in a sentence.

from nltk import pos_tag
from nltk.tokenize import word_tokenize

sentence = "The cat is on the mat"
tokens = word_tokenize(sentence)
pos_tags = pos_tag(tokens)
print("POS Tags:", pos_tags)

POS Tags: [('The', 'DT'), ('cat', 'NN'), ('is', 'VBZ'), ('on', 'IN'), ('the', 'DT'), ('mat', 'NN')]

('The', 'DT'): 'The' is a determiner (DT).
('cat', 'NN'): 'cat' is a noun (NN).
('is', 'VBZ'): 'is' is a verb, third person singular present (VBZ).
('on', 'IN'): 'on' is a preposition or subordinating conjunction (IN).
('the', 'DT'): 'the' is a determiner (DT).
('mat', 'NN'): 'mat' is a noun (NN).

Bag of Words (BoW)

A representation of text that counts the frequency of words in a document, disregarding grammar and word order.

from sklearn.feature_extraction.text import CountVectorizer

corpus = ["This is the first document.", "Another document here.", "And a third one."]
vectorizer = CountVectorizer()
bow_representation = vectorizer.fit_transform(corpus)
print("Bag of Words representation:\n", bow_representation.toarray())
print("Vocabulary:", vectorizer.get_feature_names())

TF-IDF (Term Frequency-Inverse Document Frequency)

A technique to weigh the importance of words in a document relative to a corpus based on frequency.

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["This is the first document.", "Another document here.", "And a third one."]
vectorizer = TfidfVectorizer()
tfidf_representation = vectorizer.fit_transform(corpus)
print("TF-IDF representation:\n", tfidf_representation.toarray())
print("Vocabulary:", vectorizer.get_feature_names())

TF-IDF representation:
 [[0. 0. 0.35543247 0.46735098 0. 0.46735098
 0. 0.46735098 0. 0.46735098]
 [0. 0.62276601 0.4736296 0. 0.62276601 0.
 0. 0. 0. 0. ]
 [0.57735027 0. 0. 0. 0. 0.
 0.57735027 0. 0.57735027 0. ]]
Vocabulary: ['and', 'another', 'document', 'first', 'here', 'is', 'one', 'the', 'third', 'this']

Text Preprocessing Steps

Now, we would apply the basic NLP techniques to preprocess a text string.

Convert text to lowercase.

sample_text = "NLP is an exciting field! It enables machines to comprehend, interpret, and generate human language. Learn more at https://raqibcodes.com. #NLP #MachineLearning @NLPCommunity 2023303"

# Convert text to lowercase
def convert_to_lowercase(text):
 return text.lower()

# Apply the lowercase conversion
lowercased_text = convert_to_lowercase(sample_text)
print("Lowercased Text:", lowercased_text)

Lowercased Text: nlp is an exciting field! it enables machines to comprehend, interpret, and generate human language. learn more at https://raqibcodes.com. #nlp #machinelearning @nlpcommunity 2023303

Remove special characters.

def remove_special_characters(text):
 # Remove URLs, hashtags, mentions, and special characters
 text = re.sub(r"http\S+U+007Cwww\S+U+007C@\w+U+007C#\w+", "", text)
 text = re.sub(r"[^\w\s.]", "", text)
 return text

# Apply removing special characters
text_no_special_chars = remove_special_characters(lowercased_text)
print("Text without special characters:", text_no_special_chars)

Text without special characters: nlp is an exciting field it enables machines to comprehend interpret and generate human language. learn more at 2023303

Remove numbers and digits.

def remove_numbers(text):
 # Remove numbers/digits
 text = re.sub(r'\d+(\.\d+)?', '', text)
 return text

# Apply removing numbers/digits
text_no_numbers = remove_numbers(text_no_special_chars)
print("Text without numbers/digits:", text_no_numbers)

Text without numbers/digits: nlp is an exciting field it enables machines to comprehend interpret and generate human language. learn more at

Tokenize text.

from nltk.tokenize import word_tokenize

def tokenize_text(text):
 # Tokenize the text
 tokens = word_tokenize(text)
 return tokens

# Apply tokenization
tokens = tokenize_text(text_no_numbers)
print("Tokens:", tokens)

Tokens: ['nlp', 'is', 'an', 'exciting', 'field', 'it', 'enables', 'machines', 'to', 'comprehend', 'interpret', 'and', 'generate', 'human', 'language', '.', 'learn', 'more', 'at']

Remove stopwords.

from nltk.corpus import stopwords

def remove_stopwords(tokens):
 # Remove stop words
 stop_words = set(stopwords.words('english'))
 tokens = [token for token in tokens if token not in stop_words]
 return tokens

# Apply removing stop words
tokens_no_stopwords = remove_stopwords(tokens)
print("Tokens without stop words:", tokens_no_stopwords)

Tokens without stop words: ['nlp', 'exciting', 'field', 'enables', 'machines', 'comprehend', 'interpret', 'generate', 'human', 'language', '.', 'learn']

Lemmatize words.

from nltk.stem import WordNetLemmatizer

def lemmatize_words(tokens):
 # Lemmatize the words
 lemmatizer = WordNetLemmatizer()
 tokens = [lemmatizer.lemmatize(token) for token in tokens]
 return tokens

# Apply lemmatization
lemmatized_tokens = lemmatize_words(tokens_no_stopwords)
print("Lemmatized Tokens:", lemmatized_tokens)

Lemmatized Tokens: ['nlp', 'exciting', 'field', 'enables', 'machine', 'comprehend', 'interpret', 'generate', 'human', 'language', '.', 'learn']

Apply Stemming.

from nltk.stem import PorterStemmer

def apply_stemming(tokens):
 # Apply stemming
 stemmer = PorterStemmer()
 stemmed_tokens = [stemmer.stem(token) for token in tokens]
 return stemmed_tokens

# Apply stemming
stemmed_tokens = apply_stemming(lemmatized_tokens)
print("Stemmed Tokens:", stemmed_tokens)

Stemmed Tokens: ['nlp', 'excit', 'field', 'enabl', 'machin', 'comprehend', 'interpret', 'gener', 'human', 'languag', '.', 'learn']

Join tokens back into a single string.

def join_tokens(tokens):
 # Join tokens back into a single string
 return ' '.join(tokens)

# Apply joining tokens into a single string
preprocessed_text = join_tokens(lemmatized_tokens)
print("Preprocessed Text:", preprocessed_text)

Preprocessed Text: nlp exciting field enables machine comprehend interpret generate human language . learn

Apply POS tagging.

from nltk import pos_tag

def pos_tagging(tokens):
 # Perform POS tagging
 pos_tags = pos_tag(tokens)
 return pos_tags

# Apply POS tagging
pos_tags = pos_tagging(lemmatized_tokens)
print("POS Tags:", pos_tags)

POS Tags: [('nlp', 'RB'), ('exciting', 'JJ'), ('field', 'NN'), ('enables', 'NNS'), ('machine', 'NN'), ('comprehend', 'VBP'), ('interpret', 'JJ'), ('generate', 'NN'), ('human', 'JJ'), ('language', 'NN'), ('.', '.'), ('learn', 'VB')]

Meaning of tags

('nlp', 'RB'): 'nlp' is tagged as an adverb (RB).
('exciting', 'JJ'): 'exciting' is an adjective (JJ).
('field', 'NN'): 'field' is a noun (NN).
('enables', 'NNS'): 'enables' is a plural noun (NNS).
('machine', 'NN'): 'machine' is a noun (NN).
('comprehend', 'VBP'): 'comprehend' is a verb in the base form (VBP).
('interpret', 'JJ'): 'interpret' is an adjective (JJ).
('generate', 'NN'): 'generate' is a noun (NN).
('human', 'JJ'): 'human' is an adjective (JJ).
('language', 'NN'): 'language' is a noun (NN).
('.', '.'): '.' denotes a punctuation mark (period).
('learn', 'VB'): 'learn' is a verb in the base form (VB).

Apply Bag of Words Representation.

from sklearn.feature_extraction.text import CountVectorizer

def bag_of_words_representation(text):
 # Initialize CountVectorizer
 vectorizer = CountVectorizer()
 # Transform the text into a Bag of Words representation
 bow_representation = vectorizer.fit_transform([text])
 return bow_representation, vectorizer

# Apply Bag of Words representation
bow_representation, vectorizer = bag_of_words_representation(preprocessed_text)
print("Bag of Words representation:")
print(bow_representation.toarray())
print("Vocabulary:", vectorizer.get_feature_names())

Bag of Words representation:
[[1 1 1 1 1 1 1 1 1 1 1]]
Vocabulary: ['comprehend', 'enables', 'exciting', 'field', 'generate', 'human', 'interpret', 'language', 'learn', 'machine', 'nlp']

Apply TF-IDF representation.

from sklearn.feature_extraction.text import TfidfVectorizer

def tfidf_representation(text):
 # Initialize TfidfVectorizer
 vectorizer = TfidfVectorizer()
 # Transform the text into a TF-IDF representation
 tfidf_representation = vectorizer.fit_transform([text])
 return tfidf_representation, vectorizer

# Apply TF-IDF representation
tfidf_representation, tfidf_vectorizer = tfidf_representation(preprocessed_text)
print("\nTF-IDF representation:")
print(tfidf_representation.toarray())
print("Vocabulary:", tfidf_vectorizer.get_feature_names())

TF-IDF representation:
[[0.30151134 0.30151134 0.30151134 0.30151134 0.30151134 0.30151134
 0.30151134 0.30151134 0.30151134 0.30151134 0.30151134]]
Vocabulary: ['comprehend', 'enables', 'exciting', 'field', 'generate', 'human', 'interpret', 'language', 'learn', 'machine', 'nlp']

Natural Language Processing (NLP) is the bridge between human language and machines. In this article, we’ve uncovered fundamental terms like ‘corpus,’ ‘vocabulary,’ and key techniques including ‘tokenization,’ ‘lemmatization,’ and ‘POS tagging.’ These are the building blocks for advanced NLP applications, pushing AI towards more human-like interactions.

Thanks for readingU+1F913. You can check out my GitHub repo where you can review an NLP project I worked on where I applied all of the above techniques and more advanced ones. Cheers!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Demystifying NLP: A Beginner’s Guide to Natural Language Processing Basics and Techniques

Author(s): Abdulraqib Omotosho

Document

Corpus

Vocabulary

Segmentation

Tokenization

Stopwords

Stemming

Lemmatization

POS Tagging (Part-of-Speech Tagging)

Bag of Words (BoW)

TF-IDF (Term Frequency-Inverse Document Frequency)

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

I Used ChatGPT to Count My Calories

Resource-Efficient Fine-Tuning of DeepSeek-R1

TAI #138: OpenAI’s o3-Mini and Deep Research: A New Era of Reasoning Powered Agents?

Text Preprocessing for NLP: A Step-by-Step Guide to Clean Raw Text Data

DeepSeek AI — The Future is Here

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Demystifying NLP: A Beginner’s Guide to Natural Language Processing Basics and Techniques

Author(s): Abdulraqib Omotosho

Document

Corpus

Vocabulary

Segmentation

Tokenization

Stopwords

Stemming

Lemmatization

POS Tagging (Part-of-Speech Tagging)

Bag of Words (BoW)

TF-IDF (Term Frequency-Inverse Document Frequency)

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement