Turning YouTube Comments into Expert Movie Critiques with Python and AI: A Step-by-Step Guide”
Author(s): Edoardo De Nigris
Originally published on Towards AI.
Imagine a new kind of movie review, not written by a single critic, but that emerges from the collective voice of countless YouTube comments. This review would represent a mosaic of diverse viewpoints and interpretations drawn from a broad audience. This approach offers a broader perspective, capturing a range of sentiments and insights that a single critic’s review might miss. It’s about adding depth and breadth to film criticism, making it more inclusive and reflective of a wider audience’s reactions and feelings.
This article aims to demonstrate how generative AI models can provide a fresh lens for aggregating and summarizing the collective voices on a single topic, like a movie.
In this project, we code in Python. We’ll use OpenAI API and Youtube API to categorize and synthesize a vast array of opinions found in YouTube comments on popular movies. The end product? A cohesive, AI-generated film critique that is both comprehensive and multifaceted.
Enough talks; let’s jump in the actual steps that we have to take to obtain our AI-generated “collective-voiced” Movie Review:
1. Choose a movie:
Decide the title of a movie you want to generate a movie critique. I recently watched at cinema “The boy and the heron” from the animation master Myazaki, I really liked it. Let’s see what YouTube users think about it.
2. Downloading YouTube Comments via Python API:
The project starts by extracting comments from YouTube videos related to this specific movie. We can do this by using YouTube APIs, which facilitates access to user comments through Python. The data collected at this stage consists of unstructured text, providing a diverse range of viewer opinions and reactions.
If you never used YouTube APIs don’t worry too much. They are pretty straightforward, you can find the full documentation at this link:
First of all we want to download the comments related to a video talking about a specific movie. We will limit ourselves to downloading all the comments of the top 10 videos in terms of views about this movie.
Each YouTube video is defined by a unique ID, for example video https://www.youtube.com/watch?v=t5khm-VjEu4 is defined by the unique id: t5khm-VjEu4.
First, we need a function to download the IDs of the top videos about this movie.
Example: We want to extract the IDs of the top 10 videos by views in US in English about the movie “The boy and the heron”.
topics = ["The boy and the heron"]
api_yt = "insert your API key"
# getting video ids per the slected topics
topic_ids = get_IDs_by_Topic(topic,10,"US","en",api_yt)
where the function get_IDs_by_Topic
is defined as follow:
from googleapiclient.discovery import build
def get_IDs_by_Topic(topic, maxResults,regionCode,relevanceLanguage,api):
youtube = build('youtube', 'v3',
developerKey=api)
videos_ids = youtube.search().list(
part="id",
type='video',
regionCode=regionCode,
order="relevance",
q= topic,
maxResults=maxResults,
relevanceLanguage=relevanceLanguage,
fields="items(id(videoId))"
).execute()
list_of_topic_id = []
for item in videos_ids["items"]:
id_video = item['id']['videoId']
list_of_topic_id.append(id_video)
return list_of_topic_id
Note: the q
parameter defined to search for a specific term. The API will return results that match the specified query in the title, description, or keywords of the videos, playlists, or channels.
The above function returns the top 10 YouTube IDs by views we are looking for:
The boy and the heron
t5khm-VjEu4
F99-lNqVc-U
JBKXgjo_rFw
f7EDFdA10pg
_K-Gtld4LlM
glDJxM8fEgo
23H3Ea1HtvE
UIabnyxTVpc
UhN1hwMeKVw
izD8KOA2YZI
Now that we have the 10 Videos IDs, we can download all the comments of this 10 videos:
# Get comments and likes
comments_and_likes = video_comments(topic_ids,api_yt)
The function is defined as follows:
def video_comments(list_of_video_id, api):
"""This function returns a dictionary where the key is the comment and the value is the number of likes of that comment
for all comments of a selected list of video ids.
params:
list_of_video_id: this is the list of video ids. Type: list
api: YouTube API key
"""
# dictionary: "comment" : "likes"
comment_likes_dict = {}
for video_id in list_of_video_id:
try:
print(video_id)
# empty list for storing reply
replies = []
# creating youtube resource object
youtube = build('youtube', 'v3',
developerKey=api)
# retrieve youtube video results
video_response = youtube.commentThreads().list(
part='snippet,replies',
videoId=video_id
).execute()
# iterate video response
while video_response:
# extracting required info
# from each result object
for item in video_response['items']:
# Extracting comments
comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
# counting comment likes
likecount = item['snippet']['topLevelComment']['snippet']['likeCount']
# save in a dictionary
comment_likes_dict[comment] = likecount
# counting number of reply of comment
replycount = item['snippet']['totalReplyCount']
# if reply is there
if replycount > 0:
# iterate through all reply
for comment_replies in item['replies']['comments']:
# Extract reply and likes
reply = comment_replies['snippet']['textDisplay']
reply_like_count = comment_replies["snippet"]["likeCount"]
comment_likes_dict[reply] = reply_like_count
# Store reply is list
replies.append(reply)
# print comment with list of reply
#print(comment, replies, end='\n\n')
# empty reply list
replies = []
# Again repeat
if 'nextPageToken' in video_response:
video_response = youtube.commentThreads().list(
part='snippet,replies',
videoId=video_id,
pageToken=video_response['nextPageToken']
).execute()
else:
break
except:
continue
return comment_likes_dict
This function aggregates the comments of the 10 videos and returns a single dictionary in the form {"comment" : number of likes}
. Note that we also keep the number of likes of each comment. Later in the process, we’ll focus the analysis only on the comments with the most likes, saving on OpenAI API cost.
3. Pre-Process Comments
We need to transform these comments into embeddings in order to cluster similar comments together. For those not familiar, text embedding is a way of converting words and sentences into vectors of numbers that computers can understand and work with.
When clustering embeddings of user comments, it’s crucial to recognize that not all comments should be treated as single, indivisible units. This is because some comments contain multiple distinct sentiments that can skew the clustering process if not properly segmented.
Take, for instance, a comment like this:
“The special effects were amazing, but the plot was sooo boring”
This single comment actually encompasses two contrasting sentiments: a highly positive view of the special effects and a strongly negative opinion about the plot. If we treat this comment as a whole, its embedding might incorrectly cluster it together with neutral comments due to the mixed sentiments.
Therefore, to achieve more accurate clustering, it’s advisable to split such comments into their constituent sub-comments — in this case, ‘The special effects were amazing’ and ‘but the plot was sooo boring.’ By analyzing separately the embeddings of these split comments, we can more accurately categorize the comment in either positive or negative clusters rather than incorrectly classifying it as neutral.
If we used the embeddings of the full comment as a whole without breaking it down into two components, we would risk our comment to be stuck in the middle (i.e., the comment might have been clustered together with neutral comments).
An illustration of this concept might clear this idea:
We use OpenAI API to make this split; we are going to split each comment into one or more bullet points (sub-comments).
First, we import the OpenAI Python library and set our own API key.
import openai
openai_api_key = "your API-key"
openai.api_key = openai_api_key
We transform comments and likes dictionary in a Pandas dataframe and wee keep the first 100 comments by number of likes for now (we assume comments with more likes are more relevant):
import pandas as pd
# Transform dictionary to DatabFrame
# Create a comment DataFrame
comments_df = pd.DataFrame.from_dict(comments_and_likes, orient='index').reset_index()
comments_df.columns = ["Comment","Likes"]
comments_df = comments_df.sort_values(by = "Likes", ascending = False).head(100) # we select the first 150 reviews
Then, we create a function to split each comment into sub-comments; this task is quite simple. We follow the rule of “use a simple LLM model for simple tasks and a complex model for complex tasks”. Thus, we are going to use gpt-3.5-turbo, which is simpler but also cheaper than gpt-4.
from openai import OpenAI
def generate_summary(comment):
while True:
try:
prompt = """
Given the following comment from a YouTube user, create a concise and meaningful summary in one or more bullet points.
Each bullet point maximum 20 words. Do not exceed this limit.
Each bullet point should not contain information already contained in other bullet points.
example of comment: "This car is too expensive, especially when compared to the Tesla Model 3, which at the same price offers standard titanium rims"
reply: - user emphasize that the price of the car is too high for its value
- user claims that Tesla is offers a better product at the same price
Comment:
'%s'
""" % (comment)
client = openai.OpenAI(
api_key=openai.api_key
)
completion = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": f"{prompt}"}],
stream=False,
temperature = 0.0
)
output = completion.choices[0].message.content.strip()
bullet_points = output.split('\n')
bullet_points = [point.strip() for point in bullet_points] # remove leading and trailing whitespaces
progress_bar.update(1) # only update the progress bar if the operation succeeds
return bullet_points
except Exception as e: # catch all errors
print(f"Error occurred: {e}. Sleeping for 5 seconds...")
time.sleep(5) # wait for 5 seconds before retrying
we can then apply this function to our dataframe column containing YouTube comments to get split comments:
# Create new column 'summary' by applying generate_summary function to each comment
comments_df['summary'] = comments_df['Comment'].apply(generate_summary)
comments_df = comments_df.explode('summary')
4. Clustering similar comments together
Now that we have created all the sub-comments, we can finally cluster similar comments together. To achieve this, we will:
1 . Use OpenAI API, specifically the ‘text-embedding-ada-002’ model, to convert each sub-comment into a text embedding. These embeddings represent the textual content in a numerical form that’s suitable for analysis.
2. Cluster similar embeddings together using a clustering algorithm (here for simplicity a simple K-Means). The idea is to have similar comments with the same label.
3. Embeddings are inherently multidimensional, making them challenging to visualize in their original form. To overcome this, we apply a dimensionality reduction technique. In this case, we use T-SNE (t-Distributed Stochastic Neighbor Embedding), which helps us plot these embeddings in a more comprehensible two-dimensional space.
The following piece of code is inspired by the OpenAI Cookbook on embedding clustering.
First, we need to create an additional embedding column in our dataframe in order to have a numerical representation of our comments:
import numpy as np
client = OpenAI(api_key=openai_api_key)
# Create an 'Embedding' column, one embedding for each comment.
embeddings = []
for i, comment in enumerate(comments_df["summary"], start=1):
embedding = client.embeddings.create(input=comment, model='text-embedding-ada-002').data[0].embedding
embeddings.append(embedding)
print(i)
comments_df["Embeddings"] = embeddings
comments_df["Embeddings"] = comments_df.Embeddings.apply(np.array) # convert list to numpy array
# Transform embeddings in Matrix Notation
matrix = np.vstack(comments_df.Embeddings.values)
matrix.shape
Now that we have obtained embeddings for each comment, we can apply a K-Means algorithm for clustering similar comments. For simplicity, I chose 10 clusters, but for more advanced applications this number can be fine-tuned:
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
import matplotlib
import matplotlib.pyplot as plt
import random
# Choose the number of clusters
n_clusters = 10
# now you can run K-means clustering with the optimal number of clusters
kmeans = KMeans(n_clusters=n_clusters, init="k-means++", random_state=43)
kmeans.fit(matrix)
labels = kmeans.labels_
comments_df["Cluster"] = labels
Plotting results:
color_names = ["Red", "Blue", "Green", "Yellow", "Purple", "Orange", "Black", "violet", "Pink", "Brown", "Gray", "Cyan", "Magenta", "Lime", "Maroon", "Navy", "Olive", "Teal", "Silver", "Indigo"]
def generate_color_list(n):
return random.sample(color_names, n)
tsne = TSNE(n_components=2, perplexity=15, random_state=44, init="random", learning_rate=200)
vis_dims2 = tsne.fit_transform(matrix)
x = [x for x, y in vis_dims2]
y = [y for x, y in vis_dims2]
plt.figure(figsize=(16, 9))
for category, color in enumerate(generate_color_list(n_clusters)):
xs = np.array(x)[comments_df.Cluster == category]
ys = np.array(y)[comments_df.Cluster == category]
plt.scatter(xs, ys, color=color, alpha=0.3)
avg_x = xs.mean()
avg_y = ys.mean()
plt.scatter(avg_x, avg_y, marker="x", color=color, s=100, label=f'"Cluster" {category}')
plt.title("Clusters identified visualized in language 2d using t-SNE")
plt.legend() # create the legend
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
5. Create a single movie review for each cluster of user comments
We’re almost there. At this stage we should provide one film review for each cluster of comments. It makes sense to do a review for each cluster because similar comments should contain similar information and a “single point of view.” Later we can aggregate the 10 generated movie reviews in a single comprehensive final review.
Interestingly, we can also find a title for each cluster of user comments with this code:
cluster_title_dict = {}
# We find a cluster name for each cluster
for cluster in range(n_clusters):
cluster_title = find_cluster_title(cluster, comments_df)
cluster_title_dict[cluster] = cluster_title
print(cluster_title_dict)
We obtain these 10 cluster titles, each grasping a different viewpoint or nuance on the movie:
{0: '"Rave Reviews for The Boy and the Heron: A Masterpiece of Animation and Emotion"',
1: '"The Symbolism and Character Growth in The Boy and the Heron"',
2: '"The Boy and the Heron: A Masterpiece of Animation and Emotion"',
3: '"The Boy and the Heron: A Love Letter to Miyazaki\'s Legacy"',
4: 'Glowing Reviews for "The Boy and the Heron" - A Nostalgic Masterpiece in the Style of Hayao Miyazaki and Studio Ghibli',
5: '"Enduring Love and Nostalgia for Studio Ghibli"',
6: '"The Boy and the Heron: A Mind-Altering, Surreal, and Beautiful Nightmare"',
7: "Miyazaki's Farewell and Legacy",
8: '"Rob Pattinson\'s Voice Acting and the Heron in The Boy and the Heron"',
9: '"The Enduring Magic of Miyazaki\'s Films"'}
To summarize each cluster in a single movie review, we create a simple function leveraging again the OpenAI API. This time I believe that creating a seemingly human high-quality movie review is quite a complex task. Thus I will use gpt-4, which is a more powerful model:
def clusterized_comments_summary(topic,cluster_reviews):
while True:
try:
prompt = """
This is a writing contest. You win if you write the maximum possible amount of words.
You are movie critic. You have to write your critic on this movie : %s.
You have not watched the movie, so your critic must based on these YouTube comments on that movie:
%s
Movie Review:
""" % (topic, cluster_reviews)
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You're an assistant."},
{"role": "user", "content": f"{prompt}"},
],
stream=False,
temperature=0.0
)
output = response.choices[0].message.content.strip()
print(cluster_reviews)
return output
except Exception as e: # catch all errors
print(f"Error occurred: {e}. Sleeping for 5 seconds...")
time.sleep(5) # wait for 5 seconds before retrying
We can now loop over each cluster and apply this function. The result will be 10 movie reviews, one per cluster.
reviews_summary_dict = {}
for cluster in range(n_clusters):
# collect all cluster reviews
cluster_reviews = "\n".join(
comments_df[comments_df.Cluster == cluster]["Comment"].drop_duplicates().values)
# summarize cluster reviews
reviews_summary_dict[cluster] = clusterized_comments_summary(topic,cluster_reviews)
print("cluster %s summarized"%(cluster))
the result is a single dictionary reviews_summary_dict
in the form {cluster : movie_review}
6. Reorganize movie reviews in a final one
Finally, it’s time to keep these 10 movie reviews and organize them into a final single well-structured reviews:
First, we join all 10 reviews together in a single string: This massive single string will later be reorganized in a more structured review
comment_str = ""
for k, v in reviews_summary_dict.items():
comment_str += v
I create a function that takes as input this massive string and returns a final well-organized review that makes sense of the ideas contained in each of the 10 single revires.
This is again a complex task so I chose to use gpt-4 as the to-go model. Here the model temperature
parameter has been set equal to 1, so at every run it will generate a randomly different final movie review. I set this parameter to 1 because I wanted to try different versions of the same review, you can switch it to whatever number you like.
def reorganize_summaries(comment_str):
while True:
try:
prompt = """
You are a movie critic. You have to write the review of a movie for a movie magazine.
You have not watched the movie but you are given these texts, that you must use as input:
%s
Write a comprehensive article well organized. Use as much words as you can.
Magazine Article:
""" % (comment_str)
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You're an assistant."},
{"role": "user", "content": f"{prompt}"},
],
stream=False,
temperature=1
)
output = response.choices[0].message.content.strip()
print('Summary Created')
return output
except Exception as e: # catch all errors
print(f"Error occurred: {e}. Sleeping for 5 seconds...")
time.sleep(5) # wait for 5 seconds before retrying
So we can now use this function to generate our final movie review:
movie_review = reorganize_summaries(comment_str)
And here it is our final movie review!!
Title: "The Boy and the Heron" – Miyazaki's Cinematic Love Letter
In the landscape of modern cinema, it is often difficult to find a film that truly stands out – that dares to tread where no other has gone, to challenge its viewers while also enchanting them with its timeless beauty. Hayao Miyazaki's latest film, "The Boy and the Heron," does just that and more, reaffirming his position as a masterful storyteller and animation auteur.
"The Boy and the Heron" is a masterpiece seven years in the making. Its meticulous attention to detail is evident in the animation quality that surpasses even the high bar set by prior Studio Ghibli productions. Viewers have described the film as next-level beautiful, soulful, and clean – with the fire scenes being singled out for their sheer aesthetic brilliance.
Miyazaki's latest creation feels refreshing even as it borrows shades of Spirited Away, Princess Mononoke, Totoro, and Howl's Moving Castle. The genius of this film lies in its ability to feel both new and nostalgic, offering a fresh take on Miyazaki's previous masterpieces.
"The Boy and the Heron" is a narrative that does not believe in hand-holding its audience. Its plot is oblique, inviting viewers to engage actively with the film, trying to figure out how the whimsical, the horrific, the marvelous, and the grotesque stitch together. It is a testament of respect for its audience, trusting them to navigate a plot that is not always clear-cut.
The characters emerge as interesting and well-developed entities. The protagonist, Mahito, impresses with his nuanced portrayal. His journey from discomfort with the concept of new life to accepting his new sibling is a poignant exploration of grief and acceptance.
The film is a visual love letter to animation. The trailer intentionally does not give away the plot but instead focuses on showcasing the stunning animation and fantastic music. It reminds viewers of a critical maxim of storytelling – show, not tell.
In many ways, the film is a summary of Miyazaki's life's work and potentially his final masterpiece. It is deeply personal, with viewers interpreting elements of the film as reflections of Miyazaki's life and his anti-war stance. The birds in the film serve as metaphors for fighter planes and his exploration of the true purpose of flying. It beautifully explores themes of loss, acceptance, and self-reliance.
However, it is not without its critics. Some viewers found the film's storytelling overly complex, with scenes they found puzzling. Despite these criticisms, the film was widely recognized as intense, exciting, humorous, and horrifying. The experience of watching a Ghibli film on a big screen was seen as worthwhile, with viewers expressing a yearning for a rematch.
Viewers have also praised the return of Mark Hamill, Willem Dafoe, and Christian Bale for the Ghibli dub. Their participation adds to the anticipation and excitement surrounding this film. It is a strong contender for the Oscar award for Best Animated Feature Film. This film could mark an end to an amazing era, and a fitting tribute to a man who has transformed the world of animation – there will never be another like Miyazaki.
In conclusion, "The Boy and the Heron" is a film that is rich in visual appeal and narrative depth. It is a must-watch for all fans of Miyazaki and Studio Ghibli. It is a cinematic experience that serves as a bittersweet farewell to the genius that is Hayao Miyazaki, and a beautiful testament to the magic of cinema. Despite its seemingly complex narrative, it is a film that resonates deeply with its viewers, a film that commemorates and appreciates the power of animation in storytelling. It is a film that deserves recognition, perhaps even an Oscar for Best Animated Feature Film. "The Boy and the Heron" is truly a masterpiece.
It does actually look quite human-like and does provide quite nice insights.
Some final considerations:
- The idea of combining user comments with LLMs is quite fascinating; this is just an experiment that can be extended to other domains like product reviews, hotel reviews, and all kinds of social media comments.
- OpenAI is the most popular option right now, but it would be interesting to try the same with increasingly popular open-source LLMs
- This analysis is limited to 100 comments; for more insightful reviews, you can increase the number of comments.
Thank you for your time, and feel free to add any feedback; it’s more than welcome!
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI