Demystifying AI for everyone: Part 1 -NLP Basics
Last Updated on July 17, 2023 by Editorial Team
Author(s): Himanshu Joshi
Originally published on Towards AI.
In the age of ChatGPT, let's start with the basics
Over the years, we humans have devised ways to communicate effectively with each other. One of the ways to communicate, and the most used one, is Speech. We speak with each other using various languages Ex: English, German, French, Hindi, etcβ¦
Natural Language Processing (NLP) is just one part of Artificial Intelligence (AI) that helps Computers understand and process human language.
Similar to human languages, we use NLP to devise language models so that machines can understand. Ex:- Chat GPT-3 is the third generation of OpenAIβs Generative Pretrained Transformer language models.
But hey, why do we even care about learning NLP??
Thatβs because, knowingly or unknowingly, we all use NLP in our day-to-day lives
Have you ever wondered how we get those auto-correction suggestions while typing messages or how does google lens read the words written in an image?
Everything is powered by NLP. So let's see a few use cases
Natural Language Processing (NLP) use cases:
Sentiment Analysis: This is the process of understanding the sentiment of the person speaking/writing.
Ex:- Analysis of tweets/reviews of customers to understand what they feel about a companyβs products.
Document Summarization: This is used to summarize huge blocks of texts
Ex:- Book summary or Summary of customer feedback etcβ¦
Language Translation: Translate from one language to another
Ex:- English to Japanese or vice versa.
Speech-to-text & Text-to-speech:- These are used to transcribe an audio or text or vice versa. The transcribed text can then be fed to the computers for further processing.
Ex:- Amazon Alexa
There are many other use cases, I hope you guys get a gist of a few
So in this article, let's touch upon how machines understand text data:-
Computers understand only binary information. 1 or 0, in short, numerical information.
Hence, we need to first convert text data to numerical format so that we can feed it into various NLP machine learning models for the above-mentioned use cases.
But even before we convert text to numbers. We need to work on the text data to clean it and structure it in the proper format.
Following are the steps that are generally used in the text preprocessing pipeline (some steps can be omitted based on the context of the problem):-
- Remove white spaces (extra spaces in the text, these are present due to formatting issues)
- Remove punctuations
- Remove numbers
- Remove stop words (common words which won't give much information as they are present in all documents Ex:- a, an, of, the, etcβ¦)
- Remove symbols (Ex:- @, <, $, %, etcβ¦)
- Lowercase all words
- Perform stemming/lemmatization on all words (Ex:- Runs, Running, Run all become run)
As I mentioned earlier, this is just an example of a standard general preprocessing pipeline, this should be customized project to a project basis.
Post this, we need to Tokenise the documents β Tokenisation is a process of breaking up text documents into chunks of words
So now our input data would look something like this β Every word becomes one column, and every document (sentence) is a row
Now this input is then used for Vectorization
Vectorization is nothing but converting words into vector formats so that computers can understand them
And Voila, you have understood the basics, I might say, the core of NLP.
There are many Vectorization techniques:-
- Bag of Words (BOW)
- TFIDF
- Word Embeddings
This is a topic that will require a whole article, so I will cover this in the next article.
Hope you enjoyed this post; I have tried to explain it in a very simple manner.
All the above-mentioned steps are taken care of by libraries, and you don't need to code anything on your own.
I remember when I first started learning NLP, I had a fear of everything. But when I actually started taking an interest, it was very easy.
Just try to keep learning and take small steps towards NLP. I promise nothing is difficult if you are willing to apply yourself.
All the best in your journey. Onwards and Upwards peopleβ¦
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI