Leveling Up Your Travel Agent Skills Through NLP (Part I)
Last Updated on January 7, 2023 by Editorial Team
Author(s): Navish
Natural Language Processing
A Beginners Guide to NLP Preprocessing & Topic Modelling with a UseΒ Case
Have you seen the movie Arrival? It stars Amy Adams, as a Linguist, where aliens have βarrivedβ at Earth and she is chosen (by the US Govt.) to help decode the aliensβ language. Amongst many other potent themes that the movie takes us through, I absolutely love how it depicts language as one of the fundamental underpinnings that shape a civilization.
Why am I bringing up this movie? Because when I undertook my first jab at dealing with unsupervised machine learning via Natural Language Processing, I was quite frustrated with the lack of direction you feel with it. During this period, I happened to re-watch this movie (one of my all-time favorites), which made a highly positive association in my mind with my current Data ScienceΒ project.
Q: Umm⦠What Are You Talking About?
Letβs backtrack a bitβββin Fall 2020, I was taking a project-oriented Data Science bootcamp. So far, we had been dealing with supervised learning, which has specific targets and directions to help you see if you are headed somewhere sensible because you know, itβs βsupervisedβ. We had now ventured into the land of unsupervised learning, with a focus on Natural Language Processing.
As part of the project for learning this module, I decided to create a travel preference-based recommendation system, using NLP Topic Modeling. Specifically, I focused on a single destinationβββYosemite National Park and used reviews left by users for the park on TripΒ Advisor.
Q: Why Does This ProjectΒ Matter?
If youβre a travel aficionado like me, you would be able to completely relate hereβββFiguring out HOW you should spend your time at your travel destination is a laborious undertaking! Itβs very easy to get a huge list of βMust-see thingsβ for any place you are planning to visit. But, many additional hours of research is required to understand which must-see points fit in with your personal preferences.
Do you want to be able to take βpanoramic photographs?β Or is βshuttle busβ accessibility important to you? Perhaps you would like to couple these with an βeasyΒ hikeβ?
My aim for this project is to provide an ability for travellers to get a rank order list of βMust-see Places,β based on the sort of soft preferences mentioned above.
Broadly speaking, the project consists of 4 partsβββacquiring data, NLP preprocessing, topic modeling, and making a recommendation system. The article here focuses on the middle two primarily but mentions allΒ 4.
Q: Okay, What Data Are You Using for this Data ScienceΒ Project?
The screenshot above sums it up in a nutshell. I acquired 10,000+ reviews by scrapping Trip Advisorβs page for Yosemite National Park. They have a list of βTop Attractionsβ under βThings to Doβ. Each attraction has reviews posted against it, which is what I scrapped from their site. Web scraping is a laborious process to get going correctly. But, this isnβt an article on learning how to scrape, so I am going to move on from here. If you are curious, my jupyter notebook and functions module shows how I went aboutΒ it.
Q: Do We Really Need to Preprocess?
Yes! But, keep in mindβββNLP Preprocessing is like staring into anΒ abyss.
Okay, honestly, itβs an iterative and not-so-complex process. But it sure did not feel like that when IΒ started.
So, you know how language contains text data, right? Computers arenβt a fan of that. They want numbers. To help computers understand and make sense out of the text, it somehow needs to be converted into a numerical representation.
The simplest way to do that is to take every word individually in your text data and make a column for it in a giant table. If just one word is used per column, itβs known as a unigram token. But you can also take each pair of words that occur together and make columns out of that. These would be known as bi-gram tokens. We can tokenize in umpteen waysβββ3+ words that occur together, sentences, etc. Then, each trip advisor review (known as βdocumentsβ in NLP) makes up the rows here. This resulting table is known as a Bag ofΒ Words.
For values in the table, we can fill in the frequency of occurrence for each token (column) against each row (document). If the token does not occur in that review, you fill in aΒ zero.
This is the simplest text to numbers conversion process that exists and is known as a CountΒ Vector.
You can probably start seeing some issues here. If every single unique token has a columnβββwe can quickly get a LOT of columns. Plus, different tenses of a word and punctuations can proliferate this even more. For example: βrunningβ vs βrunβ vs βranβ. A computer will treat these as three distinct words. In fact, it will even treat βrun.β differently from βrun!β because of the punctuations at the end. Certain words add no meaning towards understanding the overall text itself, like βandβ, βweβ, βtheβΒ etc.
Hence, we preprocess the text to reduce the number of possible columns (aka dimensions).
Q: So, What Are the Preprocessing Steps?
This is where it gets trickyβββwhat sort of preprocessing steps makes sense and what does not make sense? There sure are a huge variety to chooseΒ from!
In my case, I did three thingsΒ broadly:
Noise Removal using Regex and String functions to convert all characters to lowercase, remove emails/website links, separate words joined with punctuations (eg: difficult/strenuous to difficult strenuous), and remove all characters except alphabets & whitespace (eg: digits, punctuations, emojis).
Word Lemmatization is about converting words to their base form. Eg: βrunningβ and βrunsβ to βrun.β This was done using Spacy. Spacy lemmatization also changes all pronouns to β-PRON-β when lemmatizing, which was subsequently filteredΒ out.
Stop Word Removal involves getting rid of common words that do not add much meaning but are necessary for grammatical structure in the text. I used a starter list of stop words from NLTK but also curated a custom list of additional words.
The preprocessing must be based on your corpus (all documents collectively are known as the corpus inΒ NLP).
The corpus here was made up of user written reviews. Hence, given more time, I could have taken additional preprocessing steps, suchΒ as:
Spelling correctionsβββThese reviews are written by everyday users, possibly in a hurry. Hence, it would be prudent to check for spelling errors and correctΒ them.
Word NormalizationβββAgain, since this is text written on an internet forum by everyday users, some words might be represented in short form lingo, like 2mrw vs tomorrow.
Compound Term Extraction for NamesβββA unigram word tokenizer splits sentences into words separated by spaces. So, βYosemite National Parkβ will be treated as βYosemiteβ, βNationalβ, and βParkβ. It's better to link all names in the following manner βYosemite_National_Parkβ so that they do not getΒ split.
(POS) FilteringβββOne of the other common & potent forms of preprocessing is filtering words based on specific parts of speech (POS). Meaning, do I want to get rid of all proper nouns, verbs, adjectives, etc.? POS filtering is dictated more by the overall objectives. My objective was to apply topic modeling (more on that in a minute) to capture the reasons for recommending or not recommending a given attraction and then use these topics to build a recommendation system.
<Spacy picture of POS on a travelΒ review>
Based on the above image, it was important for me to keep descriptive text because that is what I wanted to capture for my topics and recommendations. Hence, I did not filter for verbs, adjectives, and common nouns. But, I did make a custom list of stopwords to remove specific descriptive words from my corpus and also got rid of names ofΒ places.
I arrived at the above decisions through an iterative cycle of preprocessing and topic modeling. In my opinion, it is very hard to get all the preprocessing done correctly at the first attempt. You need to topic model, look at the results, come back to preprocessing, and repeat cyclically.
All of this can be quite overwhelming when you start off with NLP and each use case can be quite different, with no definite indication that you are headed in the right direction. It needs time, effort, and intuition to getΒ there.
Youβre with me this far?Β Awesome!
Letβs take a look at what I mean by this iterative process, talk more about topic modeling itself, and how I made the recommendation system in Part II of this article. StayΒ tuned!
In the meanwhile, feel free to check out my Github repo for this project. You can reach me on Linkedin for any discussions.
Leveling Up Your Travel Agent Skills Through NLP (Part I) was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Published via Towards AI