Natural Language Processing: Concepts and Workflow
Author(s): Bala Priya C
With the huge influx of unstructured text data from a plethora of social media platforms , different forums and a whole wealth of documents, it’s evident that processing these sources of data to distill the information that they contain is challenging because of the inherent complexity involved in processing them. Natural Language Processing (NLP) helps greatly in processing, analyzing and understanding these sources to gain information and meaningful insights; With the recent advances in computing and easier access to computing resources, certain Deep Learning models have achieved SOTA in solving some of the most challenging NLP tasks. The NLP series by Women Who Code Data Science track gives the learners a comprehensive learning path; starting from the basics of NLP, gradually introducing advanced concepts like Deep Learning approaches to solve NLP tasks. This blog post covers module 1 of the series.
What is NLP ?
Natural language processing (NLP) can be considered to be a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language; in particular, how to program computers to process and analyze large amounts of natural language data. With interesting applications such as text classification, sentiment analysis, machine translation, speech to text, text to speech, and so on, NLP has evolved over the past few decades from rule-based approaches, statistical techniques to AI-powered applications in the recent past.
Interesting use cases of NLP
Let’s take a look at some of the common use cases of NLP.
- Machine Translation: Machine Translation is the task of automatically converting one natural language into another while preserving the meaning of the input text and producing fluent text in the output language. However, this task of machine translation comes with inherent challenges such as
- Text Classification: Text Classification is the process of assigning tags or categories to text according to its content; It’s a fundamental problem in NLP and can be done either manually(tedious, time-consuming, and susceptible to human errors) or by leveraging ML techniques.
- Sentiment Analysis: Sentiment Analysis is the contextual mining of text which identifies and extracts subjective information in the source text, such as recognizing polarity(positive, negative, neutral), identifying emotions, etc. A typical example is in the e-commerce industry, where mining and analyzing reviews for gaining insights on customer satisfaction and experience, identifying potential areas for improvement are important.
Virtual assistants such as Siri, Alexa and Cortana; Google Translate, Speech to text and text to speech converters are all cool NLP applications that we use in our everyday lives!
Challenges in understanding natural language
Natural language has such great diversity, and every language has its own rich grammar and uniqueness. The following are some of the inherent challenges that arise in NLP tasks.
Ambiguity is an intrinsic characteristic of human conversations and is particularly challenging in Natural Language Understanding scenarios where there might be different forms that are relevant in natural language and in the AI system that we’ve programmed. In AI theory, the process of handling ambiguity is called disambiguation.
Synonymity stems from the fact that we can express the same idea with different terms (which are also dependent on the specific context); For example, ‘big’ and ‘large’ have a similar meaning when referring to sizes, whereas ‘large’ doesn’t make sense when used as a qualifier to the word ‘sister.’
Co-reference is the process of finding all expressions that refer to the same entity in a text . Co-Reference resolution is an important step for a lot of higher-level NLP tasks that involve natural language understanding and is often instrumental in improving the performances of neural architectures like RNN and LSTM.
Knowledge about the structure and syntax of the language is often helpful, and some of the typical parsing techniques for understanding text syntax are described below.
- POS Tagging: Parts of speech (POS) are specific lexical categories to which words are assigned, based on their role and context in a given sentence.
For example, in the above sentence, “The brown fox is quick, and he is jumping over the lazy dog,” the abbreviations denote the following parts of speech; DET: Dependency tag, ADJ: Adjective, N: Noun , V: Verb CONJ: Conjunction (coordinating), PRON: Pronoun, ADV: Adverb.
- Shallow Parsing/Chunking: Shallow parsing, also known as chunking, is a method of analyzing the structure of a sentence and breaking it down into its smallest constituents, which usually are tokens such as words, and then grouping them together into phrases.
- Constituency Parsing: Constituency parsing aims to extract a constituency-based parse tree from a sentence. The parse tree represents its syntactic structure according to a phrase structure grammar.
- Dependency Parsing: Dependency parsing is the task of analyzing the grammatical structure of a sentence by establishing relationships between “head” words and the words which modify those heads.
Generic NLP Workflow
The standard workflow for an NLP problem comprises the above-shown steps. The first step is usually text wrangling and pre-processing on the corpus of documents, followed by parsing and basic exploratory data analysis. As the next step, we look at representing text with word embeddings and subsequent feature engineering, followed by choosing the model depending on whether we’re looking at a supervised/unsupervised learning problem. As with any ML workflow, the final stage involves model evaluation and deployment. This module covers the initial steps of text pre-processing and EDA.
Text pre-processing and Exploratory Data Analysis (EDA)
Significance of EDA
Exploratory Data Analysis (EDA) is the process of exploring data, generating insights, verifying assumptions, and revealing underlying hidden patterns in the data; Through these, we can get a basic description of the data, visualize it, identify patterns and potential challenges of using the data.
- Contraction Mapping/ Expanding Contractions: Contractions are a shortened version of words or a group of words, quite common in both spoken and written language. In English, they are quite common, such as I will to I’ll, I have to I’ve , do not to don’t, etc. Mapping these contractions to their expanded form helps in text standardization.
- Tokenization: Tokenization is the process of separating a piece of text into smaller units called tokens. Given a document, tokens can be sentences, words, subwords, or even characters depending on the application.
- Noise cleaning: Special characters and symbols contribute to extra noise in unstructured text. Using regular expressions to remove them or using tokenizers, which do the pre-processing step of removing punctuation marks and other special characters, is recommended.
- Spell-checking: Documents in a corpus are prone to spelling errors; In order to make the text clean for the subsequent processing, it is a good practice to run a spell checker and fix the spelling errors before moving on to the next steps.
- Stopwords Removal: Stop words are those words which are very common and often less significant. Hence, removing these is a pre-processing step as well. This can be done explicitly by retaining only those words in the document which are not in the list of stop words or by specifying the stop word list as an argument in CountVectorizer or TfidfVectorizer methods when getting Bag-of-Words(BoW)/TF-IDF scores for the corpus of text documents.
- Stemming/Lemmatization: Both stemming and lemmatization are methods to reduce words to their base form. While stemming follows certain rules to truncate the words to their base form, often resulting in words that are not lexicographically correct, lemmatization always results in base forms that are lexicographically correct. However, stemming is a lot faster than lemmatization. Hence, to stem/lemmatize is dependent on whether the application needs quick pre-processing or requires more accurate base forms.
Let’s walk through the steps of EDA and text pre-processing in Google Colab.
Let’s first load in the data using the pandas’ library. Setting header=None ensures that the first row of our data will not be interpreted as the column names of the data frame. Let’s also call the head() method on the dataframe to check the first few records of our data.
import pandas as pd sms = pd.read_table('/content/SMSSpamCollection', header=None) sms.head()
We can use the describe() function to obtain various summary statistics that exclude NaN values.
From the above table returned by sms.head(), we have a collection of text data with 5,572 SMS messages in English, serving as training examples. The first column is the target variable containing the class labels, which tells us if the message is spam or ham (not spam). The second column is the SMS message itself, stored as a string. Since the target variable contains discrete values, this is a classification task. Let’s start by placing the target variable in its own table and checking out how the two classes are distributed.
y = sms y.value_counts()# Output # ham 4825 # spam 747 # Name: 0, dtype: int64
We see that there are far fewer training examples for spam than ham . This is typically a class imbalance problem and should be accounted for in the subsequent analysis.
We need to encode the class labels in the target variable as numbers to ensure compatibility with some models in scikit-learn. Let’s use LabelEncoder and set ‘spam’ = 1 and ‘ham’ = 0. LabelEncoder is a part of sklearn’s pre-processing utilities.
from sklearn import preprocessing le = preprocessing.LabelEncoder() y_enc = le.fit_transform(y)
There are a couple of basic visualizations we can do. The first is to display the length of all the records. To do this, we must first label the columns with their appropriate titles and add a column to the dataframe that contains the length.
import matplotlib as plt import seaborn as sns sms.columns=['label', 'msg'] sms["length"] = sms["msg"].apply(len) sms.head()
Let’s now apply the above discussed pre-processing steps to our dataset.
Step 1: Contraction Mapping
Let’s install and import the contractions library and apply on the message strings to expand contractions, if any.
!pip install contractions import contractions# Add a new column to our dataframe called “no_contract” # apply lambda function to the "msg" field which expands contractionssms['no_contract'] = sms['msg'].apply(lambda x: [contractions.fix(word) for word in x.split()])
Since the text with contractions expanded should be tokenized separately, let’s convert them back to string and examine the dataframe again.
sms["msg_str"] = [' '.join(map(str, l)) for l in sms['no_contract']] sms.head()
Step 2: Tokenization
As discussed above, to tokenize the document into words, let’s install and import the NLTK library and apply the word_tokenize() method on each of the message strings.
import nltk nltk.download('punkt') from nltk.tokenize import word_tokenizesms['tokenized'] = sms['msg_str'].apply(word_tokenize)
Step 3: Noise Cleaning- Removing special characters, spaces, and lowercasing text.
Let’s convert all text to lower case and subsequently remove punctuations.
sms['lower'] = sms['tokenized'].apply(lambda x: [word.lower() for word in x])import string punc = string.punctuation sms['no_punc'] = sms['lower'].apply(lambda x: [word for word in x if word not in punc])sms.head()
Step 4: Spell Checking
Let’s go over a simple example of using pyspellchecker– a library for determining if a word is misspelled and what the likely correct spelling would be based on word frequency.
!pip install pyspellcheckerfrom spellchecker import SpellChecker spell = SpellChecker() # find those words that may be misspelled misspelled = spell.unknown(['something', 'is', 'hapenning', 'here']) for word in misspelled: # Get the one `most likely` answer print(spell.correction(word)) # Get a list of `likely` options print(spell.candidates(word))
Step 5: Identifying and removing stopwords
To identify and remove stopwords, we need to import the NLTK stopwords library and set our stopwords to “English” and then use the list of stopwords to retain those words in our text that are not stopwords.
nltk.download('stopwords') from nltk.corpus import stopwords stop_words = set(stopwords.words('english'))sms['stopwords_removed'] = sms['no_punc'].apply(lambda x: [word for word in x if word not in stop_words]) sms.head()
Step 6: POS tagging and Lemmatization
To apply lemmatization to our data, we have to apply parts of speech tags; in other words, determine the part of speech for each word.
nltk.download('averaged_perceptron_tagger') sms['pos_tags'] = sms['stopwords_removed'].apply(nltk.tag.pos_tag)
NLTK’s lemmatizer requires POS tags to be converted to wordnet’s format. We’ll write a function that makes the conversion.
nltk.download('wordnet') from nltk.corpus import wordnetdef get_wordnet_pos(tag): if tag.startswith('J'): return wordnet.ADJ elif tag.startswith('V'): return wordnet.VERB elif tag.startswith('N'): return wordnet.NOUN elif tag.startswith('R'): return wordnet.ADV else: return wordnet.NOUNsms['wordnet_pos'] = sms['pos_tags'].apply(lambda x: [(word, get_wordnet_pos(pos_tag)) for (word, pos_tag) in x]) sms.head()
We may now call WordNetLemmatizer on the POS tagged data. The lemmatizer function requires two parameters, the word, and its tag, in wordnet form.
from nltk.stem import WordNetLemmatizer wnl = WordNetLemmatizer() sms['lemmatized'] = sms['wordnet_pos'].apply(lambda x: [wnl.lemmatize(word, tag) for word, tag in x]) sms.head()
Now that we’ve pre-processed our text data that can be used for further steps in the pipeline, let’s save the cleaned data in a CSV file.