Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Unlock the full potential of AI with Building LLMs for Productionβ€”our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

Natural Language Processing: Concepts and Workflow
Natural Language Processing

Natural Language Processing: Concepts and Workflow

Last Updated on November 3, 2020 by Editorial Team

Author(s):Β Bala Priya C

Image for post
Photo byΒ Amador LoureiroΒ onΒ Unsplash

With the huge influx ofΒ unstructured text dataΒ from a plethora of social media platforms , different forums and a whole wealth of documents, it’s evident that processing these sources of data to distill the information that they contain is challenging because of theΒ inherent complexityΒ involved in processing them. Natural Language Processing (NLP) helps greatly inΒ processing, analyzingΒ andΒ understandingΒ these sources to gain information and meaningful insights; With the recent advances in computing and easier access to computing resources, certain Deep Learning models have achieved SOTA in solving some of the most challenging NLP tasks. The NLP series byΒ Women Who Code Data Science trackΒ gives the learners a comprehensive learning path; starting from the basics of NLP, gradually introducing advanced concepts like Deep Learning approaches to solve NLP tasks. This blog post covers module 1 of the series.

What is NLP ?

Natural language processing (NLP) can be considered to be a subfield of linguistics, computer science, and artificial intelligence concerned with theΒ interactions between computers and human language; in particular, how toΒ program computersΒ toΒ process and analyze large amounts of natural language data. With interesting applications such as text classification, sentiment analysis, machine translation, speech to text, text to speech, and so on, NLP has evolved over the past few decades from rule-based approaches, statistical techniques to AI-powered applications in the recent past.

Image Source:https://pegus.digital/business-applications-of-nlp/

Interesting use cases of NLP

Let’s take a look at some of the common use cases of NLP.

  • Machine Translation:Β Machine Translation is the task of automaticallyΒ converting one natural language into anotherΒ whileΒ preserving the meaningΒ of the input text and producing fluent text in the output language. However, this task of machine translation comes with inherent challenges such as
  • Text Classification:Β Text Classification is the process of assigning tags or categories to text according to its content; It’s a fundamental problem in NLP and can be done either manually(tedious, time-consuming, and susceptible to human errors) or by leveraging ML techniques.
  • Sentiment Analysis:Β Sentiment Analysis is the contextual mining of text which identifies and extracts subjective information in the source text, such as recognizing polarity(positive, negative, neutral), identifying emotions, etc. A typical example is in the e-commerce industry, where mining and analyzing reviews for gaining insights on customer satisfaction and experience, identifying potential areas for improvement are important.

Virtual assistants such as Siri, Alexa and Cortana; Google Translate, Speech to text and text to speech converters are all cool NLP applications that we use in our everyday lives!

Challenges in understanding natural language

Natural language has such great diversity, and every language has its own rich grammar and uniqueness. The following are some of the inherent challenges that arise in NLP tasks.

Ambiguity

Ambiguity is an intrinsic characteristic of human conversations and is particularly challenging in Natural Language Understanding scenarios where there might be different forms that are relevant in natural language and in the AI system that we’ve programmed. In AI theory, the process of handling ambiguity is calledΒ disambiguation.

Synonymity

Synonymity stems from the fact that we can express the same idea with different terms (which are also dependent on the specific context); For example,Β β€˜big’ andΒ β€˜large’ have a similar meaning when referring to sizes, whereas β€˜large’ doesn’t make sense when used as a qualifier to the word β€˜sister.’

Co-reference

Co-reference is the process of finding all expressions that refer to the same entity in a text . Co-Reference resolution is an important step for a lot of higher-level NLP tasks that involve natural language understanding and is often instrumental in improving the performances of neural architectures like RNN and LSTM.

Syntactic Rules

Knowledge about the structure and syntax of the language is often helpful, and some of the typical parsing techniques for understanding text syntax are described below.

  • POS Tagging:Β Parts of speech (POS) are specific lexical categories to which words are assigned, based on their role and context in a given sentence.
Illustration of POS tagging (Image Source)

For example, in the above sentence,Β β€œThe brown fox is quick, and he is jumping over the lazy dog,” the abbreviations denote the following parts of speech;Β DET: Dependency tag,Β ADJ: Adjective,Β N: Noun ,Β V: VerbΒ CONJ: Conjunction (coordinating),Β PRON: Pronoun,Β ADV: Adverb.

  • Shallow Parsing/Chunking:Β Shallow parsing, also known as chunking, is a method of analyzing the structure of a sentence and breaking it down into its smallest constituents, which usually are tokens such as words, and then grouping them together into phrases.
Example of Shallow Parsing/Chunking (Image Source)
  • Constituency Parsing:Β Constituency parsing aims to extract a constituency-based parse tree from a sentence. The parse tree represents its syntactic structure according to a phrase structure grammar.
Example of Constituency Parsing (Image Source)
  • Dependency Parsing:Β Dependency parsing is the task of analyzing the grammatical structure of a sentence by establishing relationships between β€œhead” words and the words which modify those heads.
Example of Dependency Parsing (Image Source)

Generic NLP Workflow

Standard NLP Workflow (Image Source)

The standard workflow for an NLP problem comprises the above-shown steps. The first step is usually text wrangling and pre-processing on the corpus of documents, followed by parsing and basic exploratory data analysis. As the next step, we look at representing text with word embeddings and subsequent feature engineering, followed by choosing the model depending on whether we’re looking at a supervised/unsupervised learning problem. As with any ML workflow, the final stage involves model evaluation and deployment. This module covers the initial steps ofΒ text pre-processingΒ andΒ EDA.

Text pre-processing and Exploratory Data Analysis (EDA)

Significance of EDA

Exploratory Data Analysis (EDA) is the process of exploring data, generating insights, verifying assumptions, and revealing underlying hidden patterns in the data; Through these, we can get a basic description of the data, visualize it, identify patterns and potential challenges of using the data.

Text preprocessing

  • Contraction Mapping/ Expanding Contractions:Β Contractions are a shortened version of words or a group of words, quite common in both spoken and written language. In English, they are quite common, such asΒ I willΒ toΒ I’ll,Β I haveΒ toΒ I’veΒ ,Β do notΒ toΒ don’t,Β etc. Mapping these contractions to their expanded form helps inΒ text standardization.
  • Tokenization:Β Tokenization is the process of separating a piece of text into smaller units calledΒ tokens. Given a document, tokens can be sentences, words, subwords, or even characters depending on the application.
  • Noise cleaning:Β Special charactersΒ andΒ symbolsΒ contribute to extra noise in unstructured text. Using regular expressions to remove them or using tokenizers, which do the pre-processing step of removing punctuation marks and other special characters, is recommended.
  • Spell-checking:Β Documents in a corpus are prone to spelling errors; In order to make the text clean for the subsequent processing, it is a good practice to run a spell checker and fix the spelling errors before moving on to the next steps.
  • Stopwords Removal:Β Stop words are those words which are very common and often less significant. Hence, removing these is a pre-processing step as well. This can be done explicitly by retaining only those words in the document which are not in the list of stop words or by specifying the stop word list as an argument inΒ CountVectorizerΒ orΒ TfidfVectorizerΒ methods when getting Bag-of-Words(BoW)/TF-IDF scores for the corpus of text documents.
  • Stemming/Lemmatization: Both stemming and lemmatization are methods to reduce words to their base form. WhileΒ stemmingΒ follows certain rules to truncate the words to their base form, often resulting in words that areΒ not lexicographically correct,Β lemmatizationΒ always results in base forms that are lexicographically correct. However, stemming is a lot faster than lemmatization. Hence, to stem/lemmatize is dependent on whether the application needs quick pre-processing or requires more accurate base forms.

Implementation

Let’s walk through the steps of EDA and text pre-processing in Google Colab.

Dataset

The dataset used is theΒ SMS Spam Collection Data Set-Β a public collection of SMS labeled messages that have been collected for mobile phone spam search, available in theΒ UCI ML repository.

Basic EDA

Let’s first load in the data using the pandas’ library. SettingΒ header=NoneΒ ensures that the first row of our data will not be interpreted as the column names of the data frame. Let’s also call theΒ head()Β method on the dataframe to check the first few records of our data.

import pandas as pd
sms = pd.read_table('/content/SMSSpamCollection', header=None)
sms.head()

We can use theΒ describe()Β function to obtain various summary statistics that exclude NaN values.

sms.describe()

From the above table returned by sms.head(), we have a collection of text data with 5,572 SMS messages in English, serving as training examples. The first column is the target variable containing the class labels, which tells us if the message is spam or ham (not spam). The second column is the SMS message itself, stored as a string. Since the target variable contains discrete values, this is aΒ classificationΒ task. Let’s start by placing the target variable in its own table and checking out how the two classes are distributed.

y = sms[0]
y.value_counts()# Output
# ham     4825
# spam     747
# Name: 0, dtype: int64

We see that there are far fewer training examples for spam than ham . This is typically a class imbalance problem and should be accounted for in the subsequent analysis.

We need to encode the class labels in the target variable as numbers to ensure compatibility with some models in scikit-learn. Let’s useΒ LabelEncoderΒ and setΒ β€˜spam’ = 1Β andΒ β€˜ham’ = 0.Β LabelEncoderΒ is a part of sklearn’s pre-processing utilities.

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
y_enc = le.fit_transform(y)

We now store all the SMS text data inΒ raw_text.Β TheΒ isnull()Β method of pandas can be used to gain insights on missing values. The given dataset, however, is complete, and there are no missing values.

Basic Visualization

There are a couple of basic visualizations we can do. The first is to display the length of all the records. To do this, we must first label the columns with their appropriate titles and add a column to the dataframe that contains the length.

import matplotlib as plt
import seaborn as sns
sms.columns=['label', 'msg']
sms["length"] = sms["msg"].apply(len)
sms.head()
sns.distplot(sms["length"], kde=False)
Plot showing the distribution of word length

Text Pre-processing

Let’s now apply the above discussed pre-processing steps to our dataset.

Step 1: Contraction Mapping

Let’s install and import the contractions library and apply on the message strings to expand contractions, if any.

!pip install contractions
import contractions# Add a new column to our dataframe called β€œno_contract”
# apply lambda function to the "msg" field which expands contractionssms['no_contract'] = sms['msg'].apply(lambda x: [contractions.fix(word) for word in x.split()])

Since the text with contractions expanded should be tokenized separately, let’s convert them back to string and examine the dataframe again.

sms["msg_str"] = [' '.join(map(str, l)) for l in sms['no_contract']]
sms.head()

Step 2: Tokenization

As discussed above, to tokenize the document into words, let’s install and import the NLTK library and apply theΒ word_tokenize()Β method on each of the message strings.

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenizesms['tokenized'] = sms['msg_str'].apply(word_tokenize)

Step 3: Noise Cleaning- Removing special characters, spaces, and lowercasing text.

Let’s convert all text to lower case and subsequently remove punctuations.

sms['lower'] = sms['tokenized'].apply(lambda x: [word.lower() for word in x])import string
punc = string.punctuation
sms['no_punc'] = sms['lower'].apply(lambda x: [word for word in x if word not in punc])sms.head()

Step 4: Spell Checking

Let’s go over a simple example of usingΒ pyspellchecker– a library for determining if a word is misspelled and what the likely correct spelling would be based on word frequency.

!pip install pyspellcheckerfrom spellchecker import SpellChecker

spell = SpellChecker()

# find those words that may be misspelled
misspelled = spell.unknown(['something', 'is', 'hapenning', 'here'])

for word in misspelled:
    # Get the one `most likely` answer
    print(spell.correction(word))

    # Get a list of `likely` options
    print(spell.candidates(word))

Step 5: Identifying and removing stopwords

To identify and remove stopwords, we need to import the NLTK stopwords library and set our stopwords to β€œEnglish” and then use the list of stopwords to retain those words in our text that are not stopwords.

nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))sms['stopwords_removed'] = sms['no_punc'].apply(lambda x: [word for word in x if word not in stop_words])
sms.head()

Step 6: POS tagging and Lemmatization

To apply lemmatization to our data, we have to apply parts of speech tags; in other words, determine the part of speech for each word.

nltk.download('averaged_perceptron_tagger')
sms['pos_tags'] = sms['stopwords_removed'].apply(nltk.tag.pos_tag)

NLTK’s lemmatizer requires POS tags to be converted to wordnet’s format. We’ll write a function that makes the conversion.

nltk.download('wordnet') 
from nltk.corpus import wordnetdef get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUNsms['wordnet_pos'] = sms['pos_tags'].apply(lambda x: [(word, get_wordnet_pos(pos_tag)) for (word, pos_tag) in x])
sms.head()

We may now callΒ WordNetLemmatizerΒ on the POS tagged data. The lemmatizer function requires two parameters, theΒ word,Β and itsΒ tag, in wordnet form.

from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
sms['lemmatized'] = sms['wordnet_pos'].apply(lambda x: [wnl.lemmatize(word, tag) for word, tag in x])
sms.head()

Now that we’ve pre-processed our text data that can be used for further steps in the pipeline, let’s save the cleaned data in a CSV file.

sms.to_csv('sms_spam_collection.csv')

The Google Colab notebook for the above implementation can be foundΒ in this repo.

The recording of the webinar can be found on YouTube.

Useful references

  1. Spam Ham Detection Notebook
  2. Spam Classification Dataset
  3. DataCamp Tutorial onΒ EDA in Python

Feedback ↓