Natural Language Processing: Concepts and Workflow
Last Updated on November 3, 2020 by Editorial Team
Author(s):Β Bala Priya C
With the huge influx ofΒ unstructured text dataΒ from a plethora of social media platforms , different forums and a whole wealth of documents, itβs evident that processing these sources of data to distill the information that they contain is challenging because of theΒ inherent complexityΒ involved in processing them. Natural Language Processing (NLP) helps greatly inΒ processing, analyzingΒ andΒ understandingΒ these sources to gain information and meaningful insights; With the recent advances in computing and easier access to computing resources, certain Deep Learning models have achieved SOTA in solving some of the most challenging NLP tasks. The NLP series byΒ Women Who Code Data Science trackΒ gives the learners a comprehensive learning path; starting from the basics of NLP, gradually introducing advanced concepts like Deep Learning approaches to solve NLP tasks. This blog post covers module 1 of the series.
What is NLP ?
Natural language processing (NLP) can be considered to be a subfield of linguistics, computer science, and artificial intelligence concerned with theΒ interactions between computers and human language; in particular, how toΒ program computersΒ toΒ process and analyze large amounts of natural language data. With interesting applications such as text classification, sentiment analysis, machine translation, speech to text, text to speech, and so on, NLP has evolved over the past few decades from rule-based approaches, statistical techniques to AI-powered applications in the recent past.
Interesting use cases of NLP
Letβs take a look at some of the common use cases of NLP.
- Machine Translation:Β Machine Translation is the task of automaticallyΒ converting one natural language into anotherΒ whileΒ preserving the meaningΒ of the input text and producing fluent text in the output language. However, this task of machine translation comes with inherent challenges such as
- Text Classification:Β Text Classification is the process of assigning tags or categories to text according to its content; Itβs a fundamental problem in NLP and can be done either manually(tedious, time-consuming, and susceptible to human errors) or by leveraging ML techniques.
- Sentiment Analysis:Β Sentiment Analysis is the contextual mining of text which identifies and extracts subjective information in the source text, such as recognizing polarity(positive, negative, neutral), identifying emotions, etc. A typical example is in the e-commerce industry, where mining and analyzing reviews for gaining insights on customer satisfaction and experience, identifying potential areas for improvement are important.
Virtual assistants such as Siri, Alexa and Cortana; Google Translate, Speech to text and text to speech converters are all cool NLP applications that we use in our everyday lives!
Challenges in understanding natural language
Natural language has such great diversity, and every language has its own rich grammar and uniqueness. The following are some of the inherent challenges that arise in NLP tasks.
Ambiguity
Ambiguity is an intrinsic characteristic of human conversations and is particularly challenging in Natural Language Understanding scenarios where there might be different forms that are relevant in natural language and in the AI system that weβve programmed. In AI theory, the process of handling ambiguity is calledΒ disambiguation.
Synonymity
Synonymity stems from the fact that we can express the same idea with different terms (which are also dependent on the specific context); For example,Β βbigβΒ andΒ βlargeβΒ have a similar meaning when referring to sizes, whereas βlargeβ doesnβt make sense when used as a qualifier to the word βsister.β
Co-reference
Co-reference is the process of finding all expressions that refer to the same entity in a text . Co-Reference resolution is an important step for a lot of higher-level NLP tasks that involve natural language understanding and is often instrumental in improving the performances of neural architectures like RNN and LSTM.
Syntactic Rules
Knowledge about the structure and syntax of the language is often helpful, and some of the typical parsing techniques for understanding text syntax are described below.
- POS Tagging:Β Parts of speech (POS) are specific lexical categories to which words are assigned, based on their role and context in a given sentence.
For example, in the above sentence,Β βThe brown fox is quick, and he is jumping over the lazy dog,βΒ the abbreviations denote the following parts of speech;Β DET: Dependency tag,Β ADJ: Adjective,Β N: Noun ,Β V: VerbΒ CONJ: Conjunction (coordinating),Β PRON: Pronoun,Β ADV: Adverb.
- Shallow Parsing/Chunking:Β Shallow parsing, also known as chunking, is a method of analyzing the structure of a sentence and breaking it down into its smallest constituents, which usually are tokens such as words, and then grouping them together into phrases.
- Constituency Parsing:Β Constituency parsing aims to extract a constituency-based parse tree from a sentence. The parse tree represents its syntactic structure according to a phrase structure grammar.
- Dependency Parsing:Β Dependency parsing is the task of analyzing the grammatical structure of a sentence by establishing relationships between βheadβ words and the words which modify those heads.
Generic NLP Workflow
The standard workflow for an NLP problem comprises the above-shown steps. The first step is usually text wrangling and pre-processing on the corpus of documents, followed by parsing and basic exploratory data analysis. As the next step, we look at representing text with word embeddings and subsequent feature engineering, followed by choosing the model depending on whether weβre looking at a supervised/unsupervised learning problem. As with any ML workflow, the final stage involves model evaluation and deployment. This module covers the initial steps ofΒ text pre-processingΒ andΒ EDA.
Text pre-processing and Exploratory Data Analysis (EDA)
Significance of EDA
Exploratory Data Analysis (EDA) is the process of exploring data, generating insights, verifying assumptions, and revealing underlying hidden patterns in the data; Through these, we can get a basic description of the data, visualize it, identify patterns and potential challenges of using the data.
Text preprocessing
- Contraction Mapping/ Expanding Contractions:Β Contractions are a shortened version of words or a group of words, quite common in both spoken and written language. In English, they are quite common, such asΒ I willΒ toΒ Iβll,Β I haveΒ toΒ IβveΒ ,Β do notΒ toΒ donβt,Β etc. Mapping these contractions to their expanded form helps inΒ text standardization.
- Tokenization:Β Tokenization is the process of separating a piece of text into smaller units calledΒ tokens. Given a document, tokens can be sentences, words, subwords, or even characters depending on the application.
- Noise cleaning:Β Special charactersΒ andΒ symbolsΒ contribute to extra noise in unstructured text. Using regular expressions to remove them or using tokenizers, which do the pre-processing step of removing punctuation marks and other special characters, is recommended.
- Spell-checking:Β Documents in a corpus are prone to spelling errors; In order to make the text clean for the subsequent processing, it is a good practice to run a spell checker and fix the spelling errors before moving on to the next steps.
- Stopwords Removal:Β Stop words are those words which are very common and often less significant. Hence, removing these is a pre-processing step as well. This can be done explicitly by retaining only those words in the document which are not in the list of stop words or by specifying the stop word list as an argument inΒ CountVectorizerΒ orΒ TfidfVectorizerΒ methods when getting Bag-of-Words(BoW)/TF-IDF scores for the corpus of text documents.
- Stemming/Lemmatization: Both stemming and lemmatization are methods to reduce words to their base form. WhileΒ stemmingΒ follows certain rules to truncate the words to their base form, often resulting in words that areΒ not lexicographically correct,Β lemmatizationΒ always results in base forms that are lexicographically correct. However, stemming is a lot faster than lemmatization. Hence, to stem/lemmatize is dependent on whether the application needs quick pre-processing or requires more accurate base forms.
Implementation
Letβs walk through the steps of EDA and text pre-processing in Google Colab.
Dataset
The dataset used is theΒ SMS Spam Collection Data Set-Β a public collection of SMS labeled messages that have been collected for mobile phone spam search, available in theΒ UCI ML repository.
Basic EDA
Letβs first load in the data using the pandas’ library. SettingΒ header=NoneΒ ensures that the first row of our data will not be interpreted as the column names of the data frame. Letβs also call theΒ head()Β method on the dataframe to check the first few records of our data.
import pandas as pd
sms = pd.read_table('/content/SMSSpamCollection', header=None)
sms.head()
We can use theΒ describe()Β function to obtain various summary statistics that exclude NaN values.
sms.describe()
From the above table returned by sms.head(), we have a collection of text data with 5,572 SMS messages in English, serving as training examples. The first column is the target variable containing the class labels, which tells us if the message is spam or ham (not spam). The second column is the SMS message itself, stored as a string. Since the target variable contains discrete values, this is aΒ classificationΒ task. Letβs start by placing the target variable in its own table and checking out how the two classes are distributed.
y = sms[0] y.value_counts()# Output # ham 4825 # spam 747 # Name: 0, dtype: int64
We see that there are far fewer training examples for spam than ham . This is typically a class imbalance problem and should be accounted for in the subsequent analysis.
We need to encode the class labels in the target variable as numbers to ensure compatibility with some models in scikit-learn. Letβs useΒ LabelEncoderΒ and setΒ βspamβ = 1Β andΒ βhamβ = 0.Β LabelEncoderΒ is a part of sklearnβs pre-processing utilities.
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
y_enc = le.fit_transform(y)
We now store all the SMS text data inΒ raw_text.Β TheΒ isnull()Β method of pandas can be used to gain insights on missing values. The given dataset, however, is complete, and there are no missing values.
Basic Visualization
There are a couple of basic visualizations we can do. The first is to display the length of all the records. To do this, we must first label the columns with their appropriate titles and add a column to the dataframe that contains the length.
import matplotlib as plt
import seaborn as sns
sms.columns=['label', 'msg']
sms["length"] = sms["msg"].apply(len)
sms.head()
sns.distplot(sms["length"], kde=False)
Text Pre-processing
Letβs now apply the above discussed pre-processing steps to our dataset.
Step 1: Contraction Mapping
Letβs install and import the contractions library and apply on the message strings to expand contractions, if any.
!pip install contractions import contractions# Add a new column to our dataframe called βno_contractβ # apply lambda function to the "msg" field which expands contractionssms['no_contract'] = sms['msg'].apply(lambda x: [contractions.fix(word) for word in x.split()])
Since the text with contractions expanded should be tokenized separately, letβs convert them back to string and examine the dataframe again.
sms["msg_str"] = [' '.join(map(str, l)) for l in sms['no_contract']]
sms.head()
Step 2: Tokenization
As discussed above, to tokenize the document into words, letβs install and import the NLTK library and apply theΒ word_tokenize()Β method on each of the message strings.
import nltk nltk.download('punkt') from nltk.tokenize import word_tokenizesms['tokenized'] = sms['msg_str'].apply(word_tokenize)
Step 3: Noise Cleaning- Removing special characters, spaces, and lowercasing text.
Letβs convert all text to lower case and subsequently remove punctuations.
sms['lower'] = sms['tokenized'].apply(lambda x: [word.lower() for word in x])import string punc = string.punctuation sms['no_punc'] = sms['lower'].apply(lambda x: [word for word in x if word not in punc])sms.head()
Step 4: Spell Checking
Letβs go over a simple example of usingΒ pyspellchecker– a library for determining if a word is misspelled and what the likely correct spelling would be based on word frequency.
!pip install pyspellcheckerfrom spellchecker import SpellChecker spell = SpellChecker() # find those words that may be misspelled misspelled = spell.unknown(['something', 'is', 'hapenning', 'here']) for word in misspelled: # Get the one `most likely` answer print(spell.correction(word)) # Get a list of `likely` options print(spell.candidates(word))
Step 5: Identifying and removing stopwords
To identify and remove stopwords, we need to import the NLTK stopwords library and set our stopwords to βEnglishβ and then use the list of stopwords to retain those words in our text that are not stopwords.
nltk.download('stopwords') from nltk.corpus import stopwords stop_words = set(stopwords.words('english'))sms['stopwords_removed'] = sms['no_punc'].apply(lambda x: [word for word in x if word not in stop_words]) sms.head()
Step 6: POS tagging and Lemmatization
To apply lemmatization to our data, we have to apply parts of speech tags; in other words, determine the part of speech for each word.
nltk.download('averaged_perceptron_tagger')
sms['pos_tags'] = sms['stopwords_removed'].apply(nltk.tag.pos_tag)
NLTKβs lemmatizer requires POS tags to be converted to wordnetβs format. Weβll write a function that makes the conversion.
nltk.download('wordnet') from nltk.corpus import wordnetdef get_wordnet_pos(tag): if tag.startswith('J'): return wordnet.ADJ elif tag.startswith('V'): return wordnet.VERB elif tag.startswith('N'): return wordnet.NOUN elif tag.startswith('R'): return wordnet.ADV else: return wordnet.NOUNsms['wordnet_pos'] = sms['pos_tags'].apply(lambda x: [(word, get_wordnet_pos(pos_tag)) for (word, pos_tag) in x]) sms.head()
We may now callΒ WordNetLemmatizerΒ on the POS tagged data. The lemmatizer function requires two parameters, theΒ word,Β and itsΒ tag, in wordnet form.
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
sms['lemmatized'] = sms['wordnet_pos'].apply(lambda x: [wnl.lemmatize(word, tag) for word, tag in x])
sms.head()
Now that weβve pre-processed our text data that can be used for further steps in the pipeline, letβs save the cleaned data in a CSV file.
sms.to_csv('sms_spam_collection.csv')
The Google Colab notebook for the above implementation can be foundΒ in this repo.
The recording of the webinar can be found on YouTube.
Useful references
- Spam Ham Detection Notebook
- Spam Classification Dataset
- DataCamp Tutorial onΒ EDA in Python