Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Demystifying NLP: A Beginner’s Guide to Natural Language Processing Basics and Techniques
Latest   Machine Learning

Demystifying NLP: A Beginner’s Guide to Natural Language Processing Basics and Techniques

Last Updated on November 5, 2023 by Editorial Team

Author(s): Abdulraqib Omotosho

Originally published on Towards AI.

Photo by Jr Korpa on Unsplash

Natural Language Processing (NLP) is an exciting field in Machine Learning that empowers machines to comprehend, interpret, understand and generate human language. It is basically the technology that allows computers to read, understand, and respond to human language — be it in the form of text or speech. It’s like teaching machines to speak our language and even generate creative, coherent, and contextually relevant responses. In this article, we will delve into the basic terms, terminologies and fundamentals of the technology that underpins ChatGPT and the plethora of generative AI systems prevalent today.

Key Terms Used Include:

Document

A piece of text, which can range from a single sentence to an entire book.

Corpus

A collection of documents used for analysis or training in NLP.

corpus = ["This is the first document.", "Another document here.", "And a third one."]

Vocabulary

The set of all unique words in a corpus.

from collections import Counter

corpus = ["This is the first document.", "Another document here.", "And a third one."]
words = ' '.join(corpus).lower().split()
vocabulary = set(words)
print("Vocabulary:", vocabulary)
Vocabulary: {'document.', 'document', 'first', 'a', 'another', 'and', 'third', 'this', 'here.', 'is', 'the', 'one.'}

Segmentation

The process of breaking a text into meaningful segments, like sentences or paragraphs.

Tokenization

Breaking text into smaller units, such as words or subwords (tokens).

from nltk.tokenize import word_tokenize

text = "Tokenization is an important step in NLP."
tokens = word_tokenize(text)
print("Tokens:", tokens)
Tokens: ['Tokenization', 'is', 'an', 'important', 'step', 'in', 'NLP', '.']

Stopwords

Commonly used words (e.g., ‘and’, ‘the’, ‘is’) that are usually removed in NLP analysis.

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
print("Stopwords:", stop_words)
Stopwords: {'he', 'once', 'm', 'this', "that'll", 'their', 'the', "she's", "you've", "hadn't", 'mightn', 'down', 'off', 'now', "shouldn't", 't', "isn't", 'that', "didn't", 'wasn', 'do', 'shan', 'yourselves', 'a', 've', 'themselves', 'out', "don't", 'hasn', 'than', 'couldn', "mightn't", "you'd", 'further', 'has', 'having', "wouldn't", 'here', 'him', 'from', 'where', 'your', 'these', 'my', 'up', 'so', 'have', 'hadn', "weren't", 'to', 'hers', 'doesn', 'below', "needn't", 're', "you're", 'when', 'whom', 'all', 'is', 'should', 'not', 'were', 'you', 'until', 'doing', "mustn't", "it's", 'because', 'y', 'her', 'both', 'o', 'weren', 'other', 'on', 'his', "shan't", 'why', 'through', 'between', 'am', 'be', 'she', 'more', 'herself', "couldn't", "doesn't", "aren't", 'which', 'won', 'of', 'don', 'some', 'was', 'under', 'few', 'needn', 'ours', 'theirs', 'it', 'aren', 'and', 'own', 'isn', 'about', 'such', 'again', 'its', 'any', 'by', 'mustn', 'had', 's', 'can', 'haven', 'before', 'over', 'those', 'during', 'while', "wasn't", 'we', 'each', 'being', 'then', 'against', 'me', "should've", 'd', 'after', 'didn', 'as', 'll', "haven't", 'wouldn', 'there', 'an', 'been', 'ourselves', "you'll", 'what', 'if', 'in', 'shouldn', 'for', 'with', 'just', 'how', 'who', 'them', 'are', 'but', 'no', 'ain', 'very', 'ma', 'same', 'above', 'into', 'himself', 'did', 'myself', 'most', 'only', 'will', 'our', 'they', 'nor', 'yours', 'at', 'too', "hasn't", 'itself', 'or', "won't", 'does', 'i', 'yourself'}

Stemming

Reducing words to their base or root form (stem).

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
word = "running"
stemmed_word = stemmer.stem(word)
print("Stemmed word:", stemmed_word)

Stemmed word: run

Lemmatization

Reducing words to their base form (lemma) considering the context.

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
word = "Raqib loves coding and dancing."
lemmatized_word = lemmatizer.lemmatize(word)
print("Lemmatized word:", lemmatized_word)

POS Tagging (Part-of-Speech Tagging)

Assigning a part-of-speech (e.g., noun, verb, adjective) to each word in a sentence.

from nltk import pos_tag
from nltk.tokenize import word_tokenize

sentence = "The cat is on the mat"
tokens = word_tokenize(sentence)
pos_tags = pos_tag(tokens)
print("POS Tags:", pos_tags)
POS Tags: [('The', 'DT'), ('cat', 'NN'), ('is', 'VBZ'), ('on', 'IN'), ('the', 'DT'), ('mat', 'NN')]
  • ('The', 'DT'): 'The' is a determiner (DT).
  • ('cat', 'NN'): 'cat' is a noun (NN).
  • ('is', 'VBZ'): 'is' is a verb, third person singular present (VBZ).
  • ('on', 'IN'): 'on' is a preposition or subordinating conjunction (IN).
  • ('the', 'DT'): 'the' is a determiner (DT).
  • ('mat', 'NN'): 'mat' is a noun (NN).

Bag of Words (BoW)

A representation of text that counts the frequency of words in a document, disregarding grammar and word order.

from sklearn.feature_extraction.text import CountVectorizer

corpus = ["This is the first document.", "Another document here.", "And a third one."]
vectorizer = CountVectorizer()
bow_representation = vectorizer.fit_transform(corpus)
print("Bag of Words representation:\n", bow_representation.toarray())
print("Vocabulary:", vectorizer.get_feature_names())

TF-IDF (Term Frequency-Inverse Document Frequency)

A technique to weigh the importance of words in a document relative to a corpus based on frequency.

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["This is the first document.", "Another document here.", "And a third one."]
vectorizer = TfidfVectorizer()
tfidf_representation = vectorizer.fit_transform(corpus)
print("TF-IDF representation:\n", tfidf_representation.toarray())
print("Vocabulary:", vectorizer.get_feature_names())
TF-IDF representation:
[[0. 0. 0.35543247 0.46735098 0. 0.46735098
0. 0.46735098 0. 0.46735098]
[0. 0.62276601 0.4736296 0. 0.62276601 0.
0. 0. 0. 0. ]
[0.57735027 0. 0. 0. 0. 0.
0.57735027 0. 0.57735027 0. ]]
Vocabulary: ['and', 'another', 'document', 'first', 'here', 'is', 'one', 'the', 'third', 'this']

Text Preprocessing Steps

Now, we would apply the basic NLP techniques to preprocess a text string.

Convert text to lowercase.

sample_text = "NLP is an exciting field! It enables machines to comprehend, interpret, and generate human language. Learn more at https://raqibcodes.com. #NLP #MachineLearning @NLPCommunity 2023303"

# Convert text to lowercase
def convert_to_lowercase(text):
return text.lower()

# Apply the lowercase conversion
lowercased_text = convert_to_lowercase(sample_text)
print("Lowercased Text:", lowercased_text)
Lowercased Text: nlp is an exciting field! it enables machines to comprehend, interpret, and generate human language. learn more at https://raqibcodes.com. #nlp #machinelearning @nlpcommunity 2023303

Remove special characters.

def remove_special_characters(text):
# Remove URLs, hashtags, mentions, and special characters
text = re.sub(r"http\S+U+007Cwww\S+U+007C@\w+U+007C#\w+", "", text)
text = re.sub(r"[^\w\s.]", "", text)
return text

# Apply removing special characters
text_no_special_chars = remove_special_characters(lowercased_text)
print("Text without special characters:", text_no_special_chars)
Text without special characters: nlp is an exciting field it enables machines to comprehend interpret and generate human language. learn more at 2023303

Remove numbers and digits.

def remove_numbers(text):
# Remove numbers/digits
text = re.sub(r'\d+(\.\d+)?', '', text)
return text

# Apply removing numbers/digits
text_no_numbers = remove_numbers(text_no_special_chars)
print("Text without numbers/digits:", text_no_numbers)
Text without numbers/digits: nlp is an exciting field it enables machines to comprehend interpret and generate human language. learn more at 

Tokenize text.

from nltk.tokenize import word_tokenize

def tokenize_text(text):
# Tokenize the text
tokens = word_tokenize(text)
return tokens

# Apply tokenization
tokens = tokenize_text(text_no_numbers)
print("Tokens:", tokens)
Tokens: ['nlp', 'is', 'an', 'exciting', 'field', 'it', 'enables', 'machines', 'to', 'comprehend', 'interpret', 'and', 'generate', 'human', 'language', '.', 'learn', 'more', 'at']

Remove stopwords.

from nltk.corpus import stopwords

def remove_stopwords(tokens):
# Remove stop words
stop_words = set(stopwords.words('english'))
tokens = [token for token in tokens if token not in stop_words]
return tokens

# Apply removing stop words
tokens_no_stopwords = remove_stopwords(tokens)
print("Tokens without stop words:", tokens_no_stopwords)
Tokens without stop words: ['nlp', 'exciting', 'field', 'enables', 'machines', 'comprehend', 'interpret', 'generate', 'human', 'language', '.', 'learn']

Lemmatize words.

from nltk.stem import WordNetLemmatizer

def lemmatize_words(tokens):
# Lemmatize the words
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(token) for token in tokens]
return tokens

# Apply lemmatization
lemmatized_tokens = lemmatize_words(tokens_no_stopwords)
print("Lemmatized Tokens:", lemmatized_tokens)
Lemmatized Tokens: ['nlp', 'exciting', 'field', 'enables', 'machine', 'comprehend', 'interpret', 'generate', 'human', 'language', '.', 'learn']

Apply Stemming.

from nltk.stem import PorterStemmer

def apply_stemming(tokens):
# Apply stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in tokens]
return stemmed_tokens

# Apply stemming
stemmed_tokens = apply_stemming(lemmatized_tokens)
print("Stemmed Tokens:", stemmed_tokens)
Stemmed Tokens: ['nlp', 'excit', 'field', 'enabl', 'machin', 'comprehend', 'interpret', 'gener', 'human', 'languag', '.', 'learn']

Join tokens back into a single string.

def join_tokens(tokens):
# Join tokens back into a single string
return ' '.join(tokens)

# Apply joining tokens into a single string
preprocessed_text = join_tokens(lemmatized_tokens)
print("Preprocessed Text:", preprocessed_text)
Preprocessed Text: nlp exciting field enables machine comprehend interpret generate human language . learn

Apply POS tagging.

from nltk import pos_tag

def pos_tagging(tokens):
# Perform POS tagging
pos_tags = pos_tag(tokens)
return pos_tags

# Apply POS tagging
pos_tags = pos_tagging(lemmatized_tokens)
print("POS Tags:", pos_tags)
POS Tags: [('nlp', 'RB'), ('exciting', 'JJ'), ('field', 'NN'), ('enables', 'NNS'), ('machine', 'NN'), ('comprehend', 'VBP'), ('interpret', 'JJ'), ('generate', 'NN'), ('human', 'JJ'), ('language', 'NN'), ('.', '.'), ('learn', 'VB')]

Meaning of tags

  • ('nlp', 'RB'): 'nlp' is tagged as an adverb (RB).
  • ('exciting', 'JJ'): 'exciting' is an adjective (JJ).
  • ('field', 'NN'): 'field' is a noun (NN).
  • ('enables', 'NNS'): 'enables' is a plural noun (NNS).
  • ('machine', 'NN'): 'machine' is a noun (NN).
  • ('comprehend', 'VBP'): 'comprehend' is a verb in the base form (VBP).
  • ('interpret', 'JJ'): 'interpret' is an adjective (JJ).
  • ('generate', 'NN'): 'generate' is a noun (NN).
  • ('human', 'JJ'): 'human' is an adjective (JJ).
  • ('language', 'NN'): 'language' is a noun (NN).
  • ('.', '.'): '.' denotes a punctuation mark (period).
  • ('learn', 'VB'): 'learn' is a verb in the base form (VB).

Apply Bag of Words Representation.

from sklearn.feature_extraction.text import CountVectorizer

def bag_of_words_representation(text):
# Initialize CountVectorizer
vectorizer = CountVectorizer()
# Transform the text into a Bag of Words representation
bow_representation = vectorizer.fit_transform([text])
return bow_representation, vectorizer

# Apply Bag of Words representation
bow_representation, vectorizer = bag_of_words_representation(preprocessed_text)
print("Bag of Words representation:")
print(bow_representation.toarray())
print("Vocabulary:", vectorizer.get_feature_names())
Bag of Words representation:
[[1 1 1 1 1 1 1 1 1 1 1]]
Vocabulary: ['comprehend', 'enables', 'exciting', 'field', 'generate', 'human', 'interpret', 'language', 'learn', 'machine', 'nlp']

Apply TF-IDF representation.

from sklearn.feature_extraction.text import TfidfVectorizer

def tfidf_representation(text):
# Initialize TfidfVectorizer
vectorizer = TfidfVectorizer()
# Transform the text into a TF-IDF representation
tfidf_representation = vectorizer.fit_transform([text])
return tfidf_representation, vectorizer

# Apply TF-IDF representation
tfidf_representation, tfidf_vectorizer = tfidf_representation(preprocessed_text)
print("\nTF-IDF representation:")
print(tfidf_representation.toarray())
print("Vocabulary:", tfidf_vectorizer.get_feature_names())
TF-IDF representation:
[[0.30151134 0.30151134 0.30151134 0.30151134 0.30151134 0.30151134
0.30151134 0.30151134 0.30151134 0.30151134 0.30151134]]
Vocabulary: ['comprehend', 'enables', 'exciting', 'field', 'generate', 'human', 'interpret', 'language', 'learn', 'machine', 'nlp']

Natural Language Processing (NLP) is the bridge between human language and machines. In this article, we’ve uncovered fundamental terms like ‘corpus,’ ‘vocabulary,’ and key techniques including ‘tokenization,’ ‘lemmatization,’ and ‘POS tagging.’ These are the building blocks for advanced NLP applications, pushing AI towards more human-like interactions.

Thanks for readingU+1F913. You can check out my GitHub repo where you can review an NLP project I worked on where I applied all of the above techniques and more advanced ones. Cheers!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓