Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Unlock the full potential of AI with Building LLMs for Productionβ€”our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

Predicting Genres from Movie Quotes
Latest

Predicting Genres from Movie Quotes

Last Updated on March 10, 2021 by Editorial Team

Author(s): Harry Roper

Natural Language Processing, WebΒ Scraping

Multi-label NLP Classification

Photo by Darya Kraplak onΒ Unsplash

Disclaimer: This article is only for educational purposes. We do not encourage anyone to scrape websites, especially those web properties that may have terms and conditions against suchΒ actions.

β€œSome day, and that day may never come, I will call upon you to do a service for me. But until that day, consider this justice a gift on my daughter’s wedding day.β€β€Šβ€”β€ŠDon Vito Corleone, The Godfather (1972)

Anyone with even a mild interest in cinema would likely be able to identify the movie that spawned the above line, not least infer its genre. Such is the power of a goodΒ quote.

But does the majesty of cinematic dialogue also resonate in the ears of aΒ machine?

This article aims to employ the features of Natural Language Processing (NLP) to build a classification model to predict movies’ genres based on quotes from their dialogue.

The model produced will be an example of a multi-label classifier, in that each instance in the data set can be assigned a positive class for zero or more labels simultaneously. Note that this differs from a multi-class classifier, since the vector of possible classes is stillΒ binary.

Predicting movie genres based on synopses is a relatively common example within the area of multi-label NLP models. There appears, however, to be little to no work using movie quotes as input. The motivation behind this article was hence to explore whether text patterns could be uncovered in movies’ dialogue to act as indicators of theirΒ genres.

The process of construction will fall under three mainΒ stages:

  1. Compiling the dataΒ set
  2. Cleaning, processing and exploring the trainingΒ data
  3. Building and evaluating the classification model

Part I: Compiling the DataΒ Set

Despite the abundance of movie-related data sets available online, I couldn’t find one that specifically contained movie quotes. With this in mind, we will need to construct our own set of training data for the building of theΒ model.

Thankfully, there’s a host of sources around the internet in which one can find information on movies, with perhaps the most-widely used being IMDb. To create our data set, we can use BeautifulSoup (a Python library for web scraping) to retrieve information from the IMDbΒ website.

The site is structured such that a movie’s quotes are displayed in a subpage of its main page. A starting point for the scrape should therefore be a list of links to movie pages that can be iterated through to pull out some details of the movie (such as the title and genres), navigate to its quote page, and retrieve the quotes fromΒ there.

I’ve chosen to use the IMDb Top 250 as the list of pages to iterate through, since these movies are likely to have received an adequate amount of user traction and should therefore offer us plenty ofΒ quotes.

To retrieve the list of page links to iterate through from the top 250 page, we first need to use the Requests library to get the page’s HTML code. From there, we can use BeautifulSoup to parse the code and extract theΒ links.

I was also able to ascertain that the links for movies’ quote subpages were simply their page links with the query β€œtrivia?tab=qt” added to the URL. With this in mind, we can take the page link and quote link for each movie and store it in a pandas DataFrame:

def get_links():
r = requests.get('https://www.imdb.com/chart/top/?ref_=nv_mv_250')
bs = BeautifulSoup(r.text, 'html.parser')
elements = bs.findAll('td', class_='titleColumn')
links = []
quote_links = []
for element in elements:
link = 'https://www.imdb.com' + element.find('a').get('href')
quote_link = link + 'trivia?tab=qt'
links.append(link)
quote_links.append(quote_link)
links_df = pd.DataFrame({'link': links, 'quote_link': quote_links})
return links_df

Now that we have our list of movie links, we need to run two iterations: one to retrieve movies’ titles and genres from their main pages, and another to retrieve the quotes from their quotes subpages.

The first iteration will result in a DataFrame where each row represents a single movie, with a column for its link, quote link, title, andΒ genres:

def get_details(links):
titles = []
genres = []
for link in links['link'].tolist():
r = requests.get(link)
bs = BeautifulSoup(r.text, 'html.parser')
wrapper = bs.find('div', class_='title_wrapper')
title = wrapper.find('h1').contents[0]
title_clean = title.replace('\xa0', '')
subtext = bs.find('div', class_='subtext')
elements = subtext.findAll('a')
genre_list = []
for element in elements:
genre = element.getText()
genre_list.append(genre)
genre = ','.join(genre_list[:-1])
titles.append(title_clean)
genres.append(genre)
movies_df = pd.DataFrame({'link': links['link'], 'title': titles, 'genre': genres, 'quote_link': links['quote_link']})
return movies_df

The second iteration will return a DataFrame where each row represents a single quote, with the other details of the movie also indicated:

def get_quotes(movies):
quotes_df = pd.DataFrame(columns=['link', 'title', 'genre', 'quote'])
for i in range(len(movies)):
link = movies['link'][i]
title = movies['title'][i]
genre = movies['genre'][i]
quote_link = movies['quote_link'][i]
r = requests.get(quote_link)
bs = BeautifulSoup(r.text, 'html.parser')
elements = bs.findAll('div', class_='sodatext')
quotes = []
for element in elements:
quote_list = []
for p in element.findAll('p'):
quote_list.append(p.contents[-1][2:])
quote = ''.join(quote_list)
quotes.append(quote)
x = len(quotes)
movie_df = pd.DataFrame({'link': [link]*x, 'title': [title]*x, 'genre': [genre]*x, 'quote': quotes})
quotes_df = pd.concat([quotes_df, movie_df])
return quotes_df

We really only need the quote and genre columns to form our training data, however we’ll leave the link and title columns intact in case we want to match the data with other data sets in future projects.

Now that we’ve extracted the quotes and genres for each movie on the page, we can complete the ETL process by saving our final DataFrame to a SQLite database:

def save_data(quotes):
engine = create_engine('sqlite:///quotes.db')
quotes.to_sql('quotes', engine, index=False, if_exists='replace')

Readers interested in downloading the data set, or indeed running the whole web scraping pipeline, can do so from this repository in myΒ Github.

Part II: Exploring the TrainingΒ Data

Cleaning and Reformatting

Our web scraping process has left us with a data set comprising four columns: two of which are of interest in building the model. These are the text documents for each movie quote (which we will transform into the model’s features), and the genres of the movie from which each quote was taken (which will act as the target variable).

Since the genres are currently listed as strings separated by commas, we’ll need to rework the column before it can be passed into a machine learning algorithm.

To create a multi-label target variable, we’ll need to construct a column for each unique genre label that indicates whether or not each quote is assigned to the genre label in question (with 1 for yes and 0 forΒ no).

genres = df['genre'].tolist()
genres = ','.join(genres)
genres = genres.split(',')
genres = sorted(list(set(genres)))
for genre in genres:
df[genre] = df['genre'].apply(lambda x: 1 if genre in x else 0)

The set of binary genre columns will as a target variable matrix in which each movie can be assigned any number of 21 uniqueΒ labels:

len(genres)
>>> 21

Exploratory Analysis

Now that we’ve reworked the data into a suitable format, let’s begin some exploration to draw some insights before we construct the model. We can start by taking a look at the number of individual genre labels to which each movie is assigned:

Figure 1: Count of movies per number of genreΒ labels

Most movies in our data set are assigned two or three genre labels. When we consider that there are 21 possible labels in total, this highlights that we can expect our target variable matrix to contain far more negative classifications than positive.

This provides a valuable insight to take into the modelling stage, in that we can observe a significant class imbalance in the training data. To assess this imbalance numerically:

df[genres].mean().mean()
>>> 0.12034039375037578

The above indicates that only 12% of the data set’s labels belong to the positive class. This factor should be given particular attention when deciding upon a method of evaluating theΒ model.

Let’s also assess the number of positive instances we have for each genreΒ label:

Figure 2: Count of positive instances per genreΒ label

On top of the class imbalance identified previously, the chart above uncovers that the data also has a significant label imbalance, in that some genres (such as Drama) have many more positive instances on which to train the model than others (such asΒ Horror).

This is likely to have implications on the model’s success between different genres.

Discussing the Results of theΒ Analysis

The analysis above uncovers two key insights about our trainingΒ data:

  1. The class distribution is heavily imbalanced in favour of the negative.

In the context of this model, class imbalance is difficult to amend. A typical method for correcting class imbalance is synthetic oversampling: the creation of new instances of the minority class with feature values close to those of the genuine instances.

However, this method is generally unsuitable for a multi-label classification problem, since any credible synthetic instances would display the same issue. The class imbalance therefore reflects the reality of the situation, in that a movie is only assigned a small number of all possibleΒ genres.

We should keep this in mind when choosing the performance metric(s) with which to evaluate the model. If, for example, we judge the model based on accuracy (correct classifications as a proportion of total classifications), we could expect to achieve a score of c.88% simply by predicting every instance as a negative (considering that only 12% of training labels are positive).

Metrics such as precision (the proportion of actual positives that were classified correctly) and recall (the proportion of positive classifications made that were correct) are more suitable in thisΒ context.

2. The distribution of positive classes is imbalanced amongstΒ labels

If we’re to use the current data set to train the model, we must accept the fact that the model will likely be able to classify some genres more accurately than others, simply because of the increased availability ofΒ data.

The most effective way of dealing with this issue would probably to return to the data set compiling stage and choose a different section of the IMDb website with a view to obtaining data that was more evenly distributed across genres. This is something that could be considered when working on an improved version of theΒ model.

Part III: Building the Classification Model

Natural Language Processing (NLP)

At present, the data for our model’s features is still in the text format following the web scrape. To transform the data into a format suitable for machine learning, we’ll need to employ some NLP techniques.

The steps involved to turn a corpus of text documents into a numerical feature matrix will be asΒ follows:

  1. Clean the text to remove all punctuation and special characters
  2. Separate the individual words in each document intoΒ tokens
  3. Lemmatise the text (grouping inflected words together, such as replacing the words β€œlearning” and β€œlearnt” withΒ β€œlearn”)
  4. Remove whitespace from tokens and set them to lowerΒ case
  5. Remove all stop words (e.g. β€œthe”, β€œand”, β€œof” etc)
  6. Vectorise each document into wordΒ counts
  7. Perform a term frequency-inverse document frequency (TF-IDF) transformation on each document to smoothen counts based on the frequency of terms in theΒ corpus

We can write the text cleaning operations (steps 1–5) into a single function:

def tokenize(text):
text = re.sub('[^a-zA-Z0-9]', ' ', text)
tokens = word_tokenize(text)
lemmatizer = WordNetLemmatizer()
clean_tokens = [lemmatizer.lemmatize(token).lower().strip() for token in tokens if token \
not in stopwords.words('english')]
return clean_tokens

that can then be passed as the tokeniser into scikit-learn’s CountVectorizer function (step 6), and finish the process with the TfidfTransformer function (stepΒ 7).

Implementing a Machine LearningΒ Pipeline

The feature variables need to undergo the NLP transformation before they can be passed into a classification algorithm. If we were to run the transformation on the entirety of the data set, it would technically cause data leakage, since the count vectorisation and TF-IDF transformation would be based on data from both the training and testingΒ set.

To combat this, we could split the data first and then run the transformations. However, this would mean completing the process once for the training data, again for the testing data, and a third time for any unseen data we wanted to classify, which would be somewhat cumbersome.

The most effective way to circumvent this issue is to include both the NLP transformations and classifier as steps in a single pipeline. With a decision tree classifier as the estimator, the pipeline for an initial baseline model would be asΒ follows:

pipeline = Pipeline([
('vect', CountVectorizer(tokenizer=tokenize)),
('tfidf', TfidfTransformer()),
('clf', MultiOutputClassifier(DecisionTreeClassifier()))
])

Note that we need to specify the estimator as a MultiOutputClassifier. This is to indicate that the model should return a prediction for each of the specified genre labels for each instance.

Evaluating the BaselineΒ Model

As discussed previously, the class imbalance in the training data needs to be considered when evaluating the performance of the model. To illustrate this point, let’s take a peek at the accuracy of the baselineΒ model.

As well as making considerations for class imbalance, we also need to adjust some of the evaluation metrics to cater for multi-label output since, unlike in single label classification, each predicted instance is no longer a hard right or wrong. For example, an instance for which the model classifies 20 of the 21 possible labels correctly should be considered more of a success than an instance for which none of the labels are classified correctly.

For readers interested in diving deeper into evaluation methods of multi-label classification models, I can recommend A Unified View of Multi-Label Performance Measures (Wu & Zhou,Β 2017).

One accepted measure of accuracy in multi-label classification is Hamming loss: the fraction of the total number of predicted labels that are misclassified. Subtracting the Hamming loss from one gives us an accuracyΒ score:

1 - hamming_loss(y_test, y_pred)
>>> 0.8859041290934979

An 88.6% accuracy score initially seems like a great result. However, before we pack up and consider the project a success, we need to consider that the class imbalance discussed previously likely means that this score is overly generous.

Let’s compare the Hamming loss to the model’s precision and recall. To return the average scores across labels weighted on each label’s number of positive classes, we can pass average='weighted' as arguments into the functions:

precision_score(y_test, y_pred, average='weighted')
>>> 0.516222795662189
recall_score(y_test, y_pred, average='weighted')
>>> 0.47363588667366213

The far more conservative measures for precision and recall likely paint a truer picture of the model’s capabilities, and indicate that the generosity of the accuracy measure was due to the abundance of true negatives.

With this in mind, we’ll use the F1 score (the harmonic mean between precision and recall) as the principal metric when evaluating theΒ model:

f1_score(y_test, y_pred, average='weighted')
>>> 0.4925448458613438

Comparing Performance AcrossΒ Labels

When exploring the training data, we hypothesised that the model would perform more effectively for some genres than others due to the imbalance in the distribution of positive classes across labels. Let’s ascertain whether this is the case by finding the F1 score for each genre label and plotting it against the total number of training quotes for thatΒ genre.

Figure 3: Relationship between number of training quotes and baseline F1Β score

Here we can observe a strong correlation (a Pearson’s coefficient of 0.89) between a label’s F1 score and its total number of training quotes, confirming our suspicions. As mentioned previously, the best way around this would be to collect a more balanced data set when building a second version of theΒ model.

Improving the Model: Selecting a Classification Algorithm

Let’s test out some other classification algorithms to see which produces the best results on the training data. To do this, we can loop through a list of the models equipped to deal with multi-label classification and print the weighted average F1 score for eachΒ one.

Before running the loop, let’s add an additional step to the pipeline: singular value decomposition (TruncatedSVD). This is a form of dimensionality reduction, which identifies the most meaningful properties of the feature matrix and removes what’s left over. It’s similar to principal component analysis (PCA), but can be used on sparse matrices.

I actually found that adding this step slightly hampered the model’s score. However, it vastly reduced the computational runtime, so I’d consider it a worthwhile trade-off.

We should also switch from evaluating the model over a single training and testing split to using the average score from a five-fold cross validation, since this will give a more robust measure of performance.

tree = DecisionTreeClassifier()
forest = RandomForestClassifier()
knn = KNeighborsClassifier()
log = LogisticRegression()
svc = SVC()
models = [tree, forest, knn, log, svc]
model_names = ['tree', 'forest', 'knn', 'log', 'svc']
scores = []
for model in models:
pipeline = Pipeline([
('vect', CountVectorizer(tokenizer=tokenize)),
('tfidf', TfidfTransformer()),
('svd', TruncatedSVD()),
('clf', MultiOutputClassifier(model))
])
cv_scores = cross_val_score(pipeline, X, y, scoring='f1_weighted', cv=5, n_jobs=-1)
score = round(np.mean(cv_scores), 4)
scores.append(score)
print(model_compare)
>>> model score
>>> 0 tree 0.3112
>>> 1 forest 0.2626
>>> 2 knn 0.2677
>>> 3 log 0.2183
>>> 4 svc 0.2175

Rather surprisingly, the decision tree used in the baseline model actually produced the best score of all the models tested. We’ll keep this as our estimator as we move onto the hyper-parameter tuning.

Improving the Model: Tuning Hyper-Parameters

As a final step in building the best model, we can run a cross validation grid search to find the most effective values for the parameters.

Since we’re using a pipeline to fit the model, we can define parameter values to test not only for the estimator, but also the NLP stages, such as the vectoriser.

pipeline = Pipeline([
('vect', CountVectorizer(tokenizer=tokenize)),
('tfidf', TfidfTransformer()),
('svd', TruncatedSVD()),
('clf', MultiOutputClassifier(DecisionTreeClassifier()))
])
parameters = {
'vect__ngram_range': [(1, 1), (1, 2)],
'vect__max_df': [0.75, 1.0],
'clf__estimator__criterion': ['gini', 'entropy'],
'clf__estimator__max_depth': [250, 500],
'clf__estimator__min_samples_split': [2, 6]
}
cv = GridSearchCV(pipeline, param_grid=parameters, scoring='f1_weighted', cv=5)
cv.fit(X, y)

Once the grid search is complete, we can view the parameters and score for our final, tunedΒ model:

print(cv.best_params_)
>>> {'clf__estimator__criterion': 'gini', 'clf__estimator__max_depth': 500, 'clf__estimator__min_samples_split': 2, 'vect__max_df': 1.0, 'vect__ngram_range': (1, 2)}
print(cv.best_score_)
>>> 0.3140845572765783

The hyper-parameter tuning has allowed us to very marginally improve the model’s performance byΒ .2 of a percentage point, giving a final F1 score of 31.4%. This means that we can expect the model to classify just under a third of the true positives correctly.

Closing Remarks

In summary, we were able to build a model that attempts to predict a movie’s genres from its quotesΒ by:

  1. Scraping the raw data from the IMDb website to create a trainingΒ set
  2. Employing NLP techniques to transform the text data into a matrix of feature variables
  3. Building a baseline classifier using a machine learning pipeline, and improving the model by evaluating performance metrics suitable in a multi-label classification context with a significant class imbalance

The final model can be used to generate predictions for new quotes. The following example uses a quote from Carnival of SoulsΒ (1962):

def predict_genres(text):
pred = pd.DataFrame(cv.predict([text]), columns=genres)
pred = pred.transpose().reset_index()
pred.columns = ['genre', 'prediction']
predictions = pred[pred['prediction']==1]['genre'].tolist()
return predictions
quote = "It's funny... the world is so different in the daylight. In the dark, your fantasies get so out of hand. But in the daylight everything falls back into place again."
predict_genres(quote)
>>> ['Action', 'Crime', 'Drama']

So what’s the final verdict? Can we propose to IMDb to adopt our model as a means of automating their genre categorisation? At this stage, probably not. However, the model created in this article should be a good enough starting point, with opportunities for making improvements in future versions by, for example, compiling a larger data set that’s more balanced across the different genres.

As mentioned previously, readers interested in downloading the data set, running the web scraping ETL pipeline, or checking out the code written to build the model can do so in this repository of my Github. Feedback, questions, and suggestions on improving the model are alwaysΒ welcome.


Predicting Genres from Movie Quotes was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Feedback ↓