Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Our 15 AI experts built the most comprehensive, practical, 90+ lesson courses to master AI Engineering - we have pathways for any experience at Towards AI Academy. Cohorts still open - use COHORT10 for 10% off.

Publication

Feature Engineering for NLP
Natural Language Processing

Feature Engineering for NLP

Last Updated on November 18, 2020 by Editorial Team

Author(s): Bala Priya C

Natural Language Processing

Part 2 of the 6 part technical series on NLP

Feature Engineering for NLP
Photo by Alfons Morales on Unsplash

Part 1 of the series covered the introductory aspects of NLP, techniques for text pre-processing and basic EDA to understand certain details about the text corpus. This part covers linguistic aspects such as Syntax, Semantics, POS tagging, Named Entity Recognition(NER) and N-grams for language modeling.

Outline
- Understanding Syntax & Semantics
- Techniques to understand text
-- POS tagging
-- Understanding Entity Parsing
-- Named Entity Recognition(NER)
-- Understanding N-grams

As we know, one of the key challenges in NLP is the inherent complexity in processing natural language; understanding the grammar and context(syntax and semantics), resolving ambiguity(disambiguation), co-reference resolution, etc.

Understanding Syntax and Semantics

Syntactic and Semantic Analysis are fundamental techniques to understand any natural language. Syntax refers to the set of rules specific to the language's grammatical structure, while Semantics refers to the meaning conveyed. Therefore, semantic analysis refers to the process of understanding the meaning and interpretation of words and sentence structure.

Sentence that is syntactically correct need not always be semantically correct!

The following picture shows one such example where a person responds to “Call me a cab! with “OK, you’re a cab!” clearly misinterpreting the context, which is not meaningful. This example shows how a syntactically correct sentence (“OK, you’re a cab!” is grammatically perfect!🙂) fails to make sense.

Syntax vs. Semantics (Image Source)

Techniques to understand a text

POS Tagging

POS Tagging, also called grammatical-tagging or word category disambiguation, refers to the process of marking a word in the corpus as corresponding to a particular part of speech, based on both its definition and context. The POS tags from the Penn Treebank project, which are widely used in practice, can be found in the below link.

Penn Treebank P.O.S. Tags

Here’s an example of a simple POS-tagged sentence, following the convention from the Penn Treebank project.

POS Tagging (Image Source)

POS Tags: PRP– Personal pronoun; VBZ– Verb,3rd person, singular, present; NNS-Noun plural, IN preposition; DT– Determiner, NN– Noun Singular

Why is POS Tagging important?

  • POS tagging is important for word sense disambiguation
  • For example, consider the sentence “Time flies like an arrow”; As illustrated below, the syntactic ambiguity gives rise to several possible interpretations out of which only one is semantically meaningful.
  • POS tagging is of importance in applications such as machine translation and information retrieval.
Illustrating Syntactic Ambiguity (Image Source)

Challenges in POS tagging: Due to the inherent ambiguity in language, POS tags are not generic.

“The same word may take different tags in different sentences depending on different contexts.

Examples:

  1. She saw a bear (bear-Noun); Your efforts will bear fruits (bear-Verb)
  2. Where is the trash can? (can-Noun) ; I can do better! (can-Modal verb)

There are several approaches to POS tagging, such as Rule-based approaches, Probabilistic (Stochastic) POS tagging using Hidden Markov Models.

Shallow Parsing/Chunking

Shallow parsing or Chunking is the process of dividing a text into syntactically related groups. It involves dividing the text into a non-overlapping contiguous subset of tokens. Shallow parsing segments and labels a multi-token sequence. It’s important in information extraction from text to create meaningful subcomponents.

Understanding steps in Entity Parsing (Image Source)

We start with raw text, clean it (text pre-processing), identify the part of speech of the words (POS tagging), identify entities within the text (Shallow parsing/chunking), and finally identify relationships between the entities. Hence, shallow parsing is important to effectively parse dependencies between entities. Here’s a simple illustration of how a chunked tree helps us understand dependencies between entities.

POS tagging and Chunked Tree (Image Source)

Named Entity Recognition(NER)

The goal of Named Entity Recognition(NER) is the process of automatically finding names of people, places, and organizations in text across many languages. NER is used in information extraction to identify and extract named entities in predefined classes. Named Entities in a text are those entities that are often more informative and contextually relevant.

The key steps involved include

  1. Identifying the named entity
  2. Extracting the named entity
  3. Categorizing the named entity into tags such as PERSON, ORGANIZATION, LOCATION, etc.
Entity Extractor (Image Source)

Understanding N-Grams

N-Grams is a useful language model aimed at finding probability distributions over word sequences. N-Gram essentially means a sequence of N words. Consider a simple example sentence, “This is Big Data AI Book,” whose unigrams, bigrams, and trigrams are shown below.

Illustrating N-grams (Image Source)

Understanding the Math

  • P(w|h): Probability of word w, given some history h
  • Example: P(the| today the sky is so clear that)
  • w: the
  • h: today, the sky is so clear that

Approach 1: Relative Frequency Count

Step 1: Take a text corpus
Step 2: Count the number of times 'today the sky is so clear that' appears
Step 3: Count the number of times it is followed by 'the'
P(the|today the sky is so clear that) =
Count(today the sky is so clear that the)/
Count(today the sky is so clear that)
# In essence, we seek to answer the question, Out of the N times we saw the history h, how many times did the word w follow it?

Disadvantages of the approach:

  • When the size of the text corpus is large, then this approach has to traverse the entire corpus.
  • Not scalable and is clearly suboptimal in performance.

Approach 2: Bigram Model

Bigram model approximates the probability of a word given all the previous words by using only the conditional probability of the preceding word. In the above example that we considered, w_(n-1)=that

Assuming Markov Model (Image Source)

This assumption that the probability of occurrence of a word depends only on the preceding word (Markov Assumption) is quite strong; In general, an N-grams model assumes dependence on the preceding (N-1) words. In practice, this N is a hyperparameter that we can play around with to check which N optimizes model performance on the specific task, say sentiment analysis, text classification, etc.😊

Putting it all together, we’ve covered the differences between syntactic and semantic analysis, importance of POS tagging, Named Entity Recognition(NER) and chunking in text analysis and briefly looked at the concept of N-grams for language modeling.

References

Below is the link to the Google Colab notebook that explains the implementation of POS Tagging, Parsing, and Named Entity Recognition(NER) on the ‘All the News’ dataset from kaggle that contains 143,000 articles from 15 publications.


Feature Engineering for NLP was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI


Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Discover Your Dream AI Career at Towards AI Jobs

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!

Note: Content contains the views of the contributing authors and not Towards AI.