Demystifying NLP: A Beginner’s Guide to Natural Language Processing Basics and Techniques
Last Updated on November 5, 2023 by Editorial Team
Author(s): Abdulraqib Omotosho
Originally published on Towards AI.
Natural Language Processing (NLP) is an exciting field in Machine Learning that empowers machines to comprehend, interpret, understand and generate human language. It is basically the technology that allows computers to read, understand, and respond to human language — be it in the form of text or speech. It’s like teaching machines to speak our language and even generate creative, coherent, and contextually relevant responses. In this article, we will delve into the basic terms, terminologies and fundamentals of the technology that underpins ChatGPT and the plethora of generative AI systems prevalent today.
Key Terms Used Include:
Document
A piece of text, which can range from a single sentence to an entire book.
Corpus
A collection of documents used for analysis or training in NLP.
corpus = ["This is the first document.", "Another document here.", "And a third one."]
Vocabulary
The set of all unique words in a corpus.
from collections import Counter
corpus = ["This is the first document.", "Another document here.", "And a third one."]
words = ' '.join(corpus).lower().split()
vocabulary = set(words)
print("Vocabulary:", vocabulary)
Vocabulary: {'document.', 'document', 'first', 'a', 'another', 'and', 'third', 'this', 'here.', 'is', 'the', 'one.'}
Segmentation
The process of breaking a text into meaningful segments, like sentences or paragraphs.
Tokenization
Breaking text into smaller units, such as words or subwords (tokens).
from nltk.tokenize import word_tokenize
text = "Tokenization is an important step in NLP."
tokens = word_tokenize(text)
print("Tokens:", tokens)
Tokens: ['Tokenization', 'is', 'an', 'important', 'step', 'in', 'NLP', '.']
Stopwords
Commonly used words (e.g., ‘and’, ‘the’, ‘is’) that are usually removed in NLP analysis.
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print("Stopwords:", stop_words)
Stopwords: {'he', 'once', 'm', 'this', "that'll", 'their', 'the', "she's", "you've", "hadn't", 'mightn', 'down', 'off', 'now', "shouldn't", 't', "isn't", 'that', "didn't", 'wasn', 'do', 'shan', 'yourselves', 'a', 've', 'themselves', 'out', "don't", 'hasn', 'than', 'couldn', "mightn't", "you'd", 'further', 'has', 'having', "wouldn't", 'here', 'him', 'from', 'where', 'your', 'these', 'my', 'up', 'so', 'have', 'hadn', "weren't", 'to', 'hers', 'doesn', 'below', "needn't", 're', "you're", 'when', 'whom', 'all', 'is', 'should', 'not', 'were', 'you', 'until', 'doing', "mustn't", "it's", 'because', 'y', 'her', 'both', 'o', 'weren', 'other', 'on', 'his', "shan't", 'why', 'through', 'between', 'am', 'be', 'she', 'more', 'herself', "couldn't", "doesn't", "aren't", 'which', 'won', 'of', 'don', 'some', 'was', 'under', 'few', 'needn', 'ours', 'theirs', 'it', 'aren', 'and', 'own', 'isn', 'about', 'such', 'again', 'its', 'any', 'by', 'mustn', 'had', 's', 'can', 'haven', 'before', 'over', 'those', 'during', 'while', "wasn't", 'we', 'each', 'being', 'then', 'against', 'me', "should've", 'd', 'after', 'didn', 'as', 'll', "haven't", 'wouldn', 'there', 'an', 'been', 'ourselves', "you'll", 'what', 'if', 'in', 'shouldn', 'for', 'with', 'just', 'how', 'who', 'them', 'are', 'but', 'no', 'ain', 'very', 'ma', 'same', 'above', 'into', 'himself', 'did', 'myself', 'most', 'only', 'will', 'our', 'they', 'nor', 'yours', 'at', 'too', "hasn't", 'itself', 'or', "won't", 'does', 'i', 'yourself'}
Stemming
Reducing words to their base or root form (stem).
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
word = "running"
stemmed_word = stemmer.stem(word)
print("Stemmed word:", stemmed_word)
Stemmed word: run
Lemmatization
Reducing words to their base form (lemma) considering the context.
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
word = "Raqib loves coding and dancing."
lemmatized_word = lemmatizer.lemmatize(word)
print("Lemmatized word:", lemmatized_word)
POS Tagging (Part-of-Speech Tagging)
Assigning a part-of-speech (e.g., noun, verb, adjective) to each word in a sentence.
from nltk import pos_tag
from nltk.tokenize import word_tokenize
sentence = "The cat is on the mat"
tokens = word_tokenize(sentence)
pos_tags = pos_tag(tokens)
print("POS Tags:", pos_tags)
POS Tags: [('The', 'DT'), ('cat', 'NN'), ('is', 'VBZ'), ('on', 'IN'), ('the', 'DT'), ('mat', 'NN')]
('The', 'DT')
: 'The' is a determiner (DT).('cat', 'NN')
: 'cat' is a noun (NN).('is', 'VBZ')
: 'is' is a verb, third person singular present (VBZ).('on', 'IN')
: 'on' is a preposition or subordinating conjunction (IN).('the', 'DT')
: 'the' is a determiner (DT).('mat', 'NN')
: 'mat' is a noun (NN).
Bag of Words (BoW)
A representation of text that counts the frequency of words in a document, disregarding grammar and word order.
from sklearn.feature_extraction.text import CountVectorizer
corpus = ["This is the first document.", "Another document here.", "And a third one."]
vectorizer = CountVectorizer()
bow_representation = vectorizer.fit_transform(corpus)
print("Bag of Words representation:\n", bow_representation.toarray())
print("Vocabulary:", vectorizer.get_feature_names())
TF-IDF (Term Frequency-Inverse Document Frequency)
A technique to weigh the importance of words in a document relative to a corpus based on frequency.
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["This is the first document.", "Another document here.", "And a third one."]
vectorizer = TfidfVectorizer()
tfidf_representation = vectorizer.fit_transform(corpus)
print("TF-IDF representation:\n", tfidf_representation.toarray())
print("Vocabulary:", vectorizer.get_feature_names())
TF-IDF representation:
[[0. 0. 0.35543247 0.46735098 0. 0.46735098
0. 0.46735098 0. 0.46735098]
[0. 0.62276601 0.4736296 0. 0.62276601 0.
0. 0. 0. 0. ]
[0.57735027 0. 0. 0. 0. 0.
0.57735027 0. 0.57735027 0. ]]
Vocabulary: ['and', 'another', 'document', 'first', 'here', 'is', 'one', 'the', 'third', 'this']
Text Preprocessing Steps
Now, we would apply the basic NLP techniques to preprocess a text string.
Convert text to lowercase.
sample_text = "NLP is an exciting field! It enables machines to comprehend, interpret, and generate human language. Learn more at https://raqibcodes.com. #NLP #MachineLearning @NLPCommunity 2023303"
# Convert text to lowercase
def convert_to_lowercase(text):
return text.lower()
# Apply the lowercase conversion
lowercased_text = convert_to_lowercase(sample_text)
print("Lowercased Text:", lowercased_text)
Lowercased Text: nlp is an exciting field! it enables machines to comprehend, interpret, and generate human language. learn more at https://raqibcodes.com. #nlp #machinelearning @nlpcommunity 2023303
Remove special characters.
def remove_special_characters(text):
# Remove URLs, hashtags, mentions, and special characters
text = re.sub(r"http\S+U+007Cwww\S+U+007C@\w+U+007C#\w+", "", text)
text = re.sub(r"[^\w\s.]", "", text)
return text
# Apply removing special characters
text_no_special_chars = remove_special_characters(lowercased_text)
print("Text without special characters:", text_no_special_chars)
Text without special characters: nlp is an exciting field it enables machines to comprehend interpret and generate human language. learn more at 2023303
Remove numbers and digits.
def remove_numbers(text):
# Remove numbers/digits
text = re.sub(r'\d+(\.\d+)?', '', text)
return text
# Apply removing numbers/digits
text_no_numbers = remove_numbers(text_no_special_chars)
print("Text without numbers/digits:", text_no_numbers)
Text without numbers/digits: nlp is an exciting field it enables machines to comprehend interpret and generate human language. learn more at
Tokenize text.
from nltk.tokenize import word_tokenize
def tokenize_text(text):
# Tokenize the text
tokens = word_tokenize(text)
return tokens
# Apply tokenization
tokens = tokenize_text(text_no_numbers)
print("Tokens:", tokens)
Tokens: ['nlp', 'is', 'an', 'exciting', 'field', 'it', 'enables', 'machines', 'to', 'comprehend', 'interpret', 'and', 'generate', 'human', 'language', '.', 'learn', 'more', 'at']
Remove stopwords.
from nltk.corpus import stopwords
def remove_stopwords(tokens):
# Remove stop words
stop_words = set(stopwords.words('english'))
tokens = [token for token in tokens if token not in stop_words]
return tokens
# Apply removing stop words
tokens_no_stopwords = remove_stopwords(tokens)
print("Tokens without stop words:", tokens_no_stopwords)
Tokens without stop words: ['nlp', 'exciting', 'field', 'enables', 'machines', 'comprehend', 'interpret', 'generate', 'human', 'language', '.', 'learn']
Lemmatize words.
from nltk.stem import WordNetLemmatizer
def lemmatize_words(tokens):
# Lemmatize the words
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(token) for token in tokens]
return tokens
# Apply lemmatization
lemmatized_tokens = lemmatize_words(tokens_no_stopwords)
print("Lemmatized Tokens:", lemmatized_tokens)
Lemmatized Tokens: ['nlp', 'exciting', 'field', 'enables', 'machine', 'comprehend', 'interpret', 'generate', 'human', 'language', '.', 'learn']
Apply Stemming.
from nltk.stem import PorterStemmer
def apply_stemming(tokens):
# Apply stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in tokens]
return stemmed_tokens
# Apply stemming
stemmed_tokens = apply_stemming(lemmatized_tokens)
print("Stemmed Tokens:", stemmed_tokens)
Stemmed Tokens: ['nlp', 'excit', 'field', 'enabl', 'machin', 'comprehend', 'interpret', 'gener', 'human', 'languag', '.', 'learn']
Join tokens back into a single string.
def join_tokens(tokens):
# Join tokens back into a single string
return ' '.join(tokens)
# Apply joining tokens into a single string
preprocessed_text = join_tokens(lemmatized_tokens)
print("Preprocessed Text:", preprocessed_text)
Preprocessed Text: nlp exciting field enables machine comprehend interpret generate human language . learn
Apply POS tagging.
from nltk import pos_tag
def pos_tagging(tokens):
# Perform POS tagging
pos_tags = pos_tag(tokens)
return pos_tags
# Apply POS tagging
pos_tags = pos_tagging(lemmatized_tokens)
print("POS Tags:", pos_tags)
POS Tags: [('nlp', 'RB'), ('exciting', 'JJ'), ('field', 'NN'), ('enables', 'NNS'), ('machine', 'NN'), ('comprehend', 'VBP'), ('interpret', 'JJ'), ('generate', 'NN'), ('human', 'JJ'), ('language', 'NN'), ('.', '.'), ('learn', 'VB')]
Meaning of tags
('nlp', 'RB')
: 'nlp' is tagged as an adverb (RB).('exciting', 'JJ')
: 'exciting' is an adjective (JJ).('field', 'NN')
: 'field' is a noun (NN).('enables', 'NNS')
: 'enables' is a plural noun (NNS).('machine', 'NN')
: 'machine' is a noun (NN).('comprehend', 'VBP')
: 'comprehend' is a verb in the base form (VBP).('interpret', 'JJ')
: 'interpret' is an adjective (JJ).('generate', 'NN')
: 'generate' is a noun (NN).('human', 'JJ')
: 'human' is an adjective (JJ).('language', 'NN')
: 'language' is a noun (NN).('.', '.')
: '.' denotes a punctuation mark (period).('learn', 'VB')
: 'learn' is a verb in the base form (VB).
Apply Bag of Words Representation.
from sklearn.feature_extraction.text import CountVectorizer
def bag_of_words_representation(text):
# Initialize CountVectorizer
vectorizer = CountVectorizer()
# Transform the text into a Bag of Words representation
bow_representation = vectorizer.fit_transform([text])
return bow_representation, vectorizer
# Apply Bag of Words representation
bow_representation, vectorizer = bag_of_words_representation(preprocessed_text)
print("Bag of Words representation:")
print(bow_representation.toarray())
print("Vocabulary:", vectorizer.get_feature_names())
Bag of Words representation:
[[1 1 1 1 1 1 1 1 1 1 1]]
Vocabulary: ['comprehend', 'enables', 'exciting', 'field', 'generate', 'human', 'interpret', 'language', 'learn', 'machine', 'nlp']
Apply TF-IDF representation.
from sklearn.feature_extraction.text import TfidfVectorizer
def tfidf_representation(text):
# Initialize TfidfVectorizer
vectorizer = TfidfVectorizer()
# Transform the text into a TF-IDF representation
tfidf_representation = vectorizer.fit_transform([text])
return tfidf_representation, vectorizer
# Apply TF-IDF representation
tfidf_representation, tfidf_vectorizer = tfidf_representation(preprocessed_text)
print("\nTF-IDF representation:")
print(tfidf_representation.toarray())
print("Vocabulary:", tfidf_vectorizer.get_feature_names())
TF-IDF representation:
[[0.30151134 0.30151134 0.30151134 0.30151134 0.30151134 0.30151134
0.30151134 0.30151134 0.30151134 0.30151134 0.30151134]]
Vocabulary: ['comprehend', 'enables', 'exciting', 'field', 'generate', 'human', 'interpret', 'language', 'learn', 'machine', 'nlp']
Natural Language Processing (NLP) is the bridge between human language and machines. In this article, we’ve uncovered fundamental terms like ‘corpus,’ ‘vocabulary,’ and key techniques including ‘tokenization,’ ‘lemmatization,’ and ‘POS tagging.’ These are the building blocks for advanced NLP applications, pushing AI towards more human-like interactions.
Thanks for readingU+1F913. You can check out my GitHub repo where you can review an NLP project I worked on where I applied all of the above techniques and more advanced ones. Cheers!
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI