data:image/s3,"s3://crabby-images/fee40/fee4076b7507278020251914704ffdbf2312589e" alt="Country Recognition and Geolocated Sentiment Analysis Using the RoBERTa Model Country Recognition and Geolocated Sentiment Analysis Using the RoBERTa Model"
Country Recognition and Geolocated Sentiment Analysis Using the RoBERTa Model
Last Updated on February 10, 2025 by Editorial Team
Author(s): Pedro Markovicz
Originally published on Towards AI.
Country Recognition and Geolocated Sentiment Analysis Using the RoBERTa Model
Have you ever wondered how public opinion about a country shapes its global image? From travel reviews to political debates on social media, peopleβs opinions often carry an emotional tone that can reveal intriguing regional patterns. What if we could map these sentiments across the globe and uncover insights that go beyond just words?
Thatβs where Geolocated Sentiment Analysis comes in. Sentiments arenβt just personal, theyβre influenced by culture, region, and national identity. By building an up-to-date dataset from Social Media comments and mapping them to specific countries and associated sentiments, we can gain deeper insights into these emotions.
In essence, this study combines Country Recognition with Sentiment Analysis, leveraging the RoBERTa NLP model for Named Entity Recognition (NER) and Sentiment Classification to explore how sentiments vary across different geographical regions.
Data Extraction
This project explores how data from Reddit, a widely used platform for discussions and content sharing, can be utilized to analyze global sentiment trends.
The data collection was performed using PRAW (Python Reddit API Wrapper), which enabled the extraction of relevant content from communities focused on Travel, News, Continents and Countries. This approach allows for an assessment of global sentiment trends as explored in this project.
The PRAW API enables the extraction of the following data:
- Title: The title of the post provided by the original poster;
- Comment: A specific comment made by a user on the post;
- Flair: The category or tag selected by the original poster for the post;
- Date: The exact year, month, and day the post was created.
# Import the necessary libraries
import praw
import datetime
import pandas as pd
# Initialize Reddit API connection
reddit = praw.Reddit(
client_id="your_client_id",
client_secret="your_client_secret",
user_agent="your_user_agent",
)
# Define subreddit and fetch top posts
subreddit_name = 'travel'
subreddit = reddit.subreddit(subreddit_name)
posts = subreddit.top(limit=None)
# Set criteria for collecting posts and comments
min_comments_per_post = 2
max_rows = 50000
data = []
# Iterate over the top posts
for post in posts:
if post.num_comments >= min_comments_per_post:
post_data = {
'title': f"{post.title} / {post.selftext}",
'date': datetime.datetime.utcfromtimestamp(post.created_utc),
'flair': post.link_flair_text,
}
# Extract comments from the post
comments = [comment.body for comment in post.comments if isinstance(comment, praw.models.Comment)]
# Store each comment along with the post data
for comment in comments:
data.append({
'title': post_data['title'],
'comment': comment,
'date': post_data['date'],
'flair': post_data['flair'],
})
# Stop collecting data once the row limit is reached
if len(data) >= max_rows:
break
# Break out of the main loop if the row limit is reached
if len(data) >= max_rows:
break
# Create a DataFrame from the collected data
df = pd.DataFrame(data, columns=['title', 'comment', 'date', 'flair', ])
The figure below presents an example, showcasing the first five unique entries extracted from the dataset.
The Challenges of NLP: NER and Sentiment Analysis
Extracting emotions values in text and associating them with specific regions isnβt straightforward. It involves tackling two main NLP challenges:
- Named Entity Recognition (NER): Identifying and linking mentions of countries within text.
- Sentiment Analysis: Determining whether the emotional tone of the text is positive, negative, or neutral.
Both challenges will be addressed using RoBERTa, a state-of-the-art Transformer-based machine learning model. RoBERTa is an optimized variant of BERT, designed to improve the pretraining process and fine-tune hyperparameters, leading to enhanced performance across a wide range of natural language processing tasks.
Named Entity Recognition (NER)
To address the NER challenge, a pre-trained RoBERTa model designed for Named Entity Recognition tasks was utilized. The model, sourced from HuggingFace, is:
To optimize the modelβs search process, the algorithm initially iterates through lists and dictionaries containing cities, capitals, demonyms, and political leaders associated with the respective countries, considering both the title and comment columns. Any values it fails to identify are then passed to the NER model for further processing.
# Import the necessary libraries
import re
import pandas as pd
from transformers import pipeline
def extract_countries(row, demonym_mapping, city_to_country, countries_list, ner_pipeline):
# Helper function to search for countries/cities/demonyms in the text
def find_matches_in_text(text, search_dict):
found_countries = set()
for key, country in search_dict.items():
pattern = r'\b' + re.escape(key.lower()) + r'\b'
if re.search(pattern, text):
found_countries.add(country)
return found_countries
# Prepare the text
text = row['post_title_body'].lower() + ' ' + row['comment_body'].lower()
# Search in the static list of countries
found_countries = find_matches_in_text(text, {country: country for country in countries_list})
# Search for demonyms and political leaders
found_countries.update(find_matches_in_text(text, demonym_mapping))
# Search for cities
found_countries.update(find_matches_in_text(text, city_to_country))
# If no country is found, use the RoBERTa model
if not found_countries:
found_countries.update(extract_countries_with_roberta(text, ner_pipeline, countries_list))
return list(found_countries) if found_countries else None
def extract_countries_with_roberta(text, ner_pipeline, countries_list):
# Apply the NER pipeline to the text
found_countries = set()
ner_results = ner_pipeline(text)
# Iterate over the entities recognized by the NER model
for entity in ner_results:
# Check if the recognized entity is a country
if 'LOC' in entity.get('entity', ''):
entity_text = entity['word'].lower()
# Iterate over the list of countries and check if any of them is contained in the recognized entity
for country in countries_list:
if country.lower() in entity_text:
found_countries.add(country)
return found_countries
# Load the NER pipeline with the RoBERTa model
ner_pipeline = pipeline("ner", model="Jean-Baptiste/roberta-large-ner-english")
# Apply the extract_countries function to the DataFrame
df['countries'] = df.apply(lambda row: extract_countries(
row=row,
demonym_mapping=demonym_mapping,
city_to_country=city_to_country,
countries_list=countries_list,
ner_pipeline=ner_pipeline
), axis=1)
Before feeding the text data into the NER model, a thorough cleaning is performed to ensure itβs optimally prepared. This cleaning process includes removing special characters, converting all text to lowercase, and eliminating stopwords, all of which enhance the efficiency of NLP tokenization process.
Once the data is cleaned, it passes through the country recognition pipeline, where a new column titled countries is added to the dataset. This column stores a list of all the countries mentioned in the text.
Sentiment Analysis
For the Sentiment Analysis challenge, a pre-trained RoBERTa model specifically designed for sentiment classification was used. The model, sourced from Hugging Face, is:
# Import the necessary libraries
from transformers import RobertaTokenizerFast, RobertaForSequenceClassification
# Load the tokenizer and pre-trained RoBERTa model for sentiment analysis
tokenizer = RobertaTokenizerFast.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment-latest")
model = RobertaForSequenceClassification.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment-latest")
# Function to calculate sentiment
def calculate_sentiment(comment_text):
# Tokenize and prepare the input data
encoding = tokenizer(comment_text, return_tensors='pt', padding=True, truncation=True, max_length=128)
input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask']
# Make prediction without gradients
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
_, preds = torch.max(outputs.logits, dim=1)
return preds.item()
# Apply the function to the comments in the DataFrame
df['sentiment'] = df['comment_body'].apply(calculate_sentiment)
By integrating this pre-trained model into the pipeline, the sentiment score is computed for the text in the comment column. A score of 0 indicates negative sentiment, 1 indicates neutral sentiment, and 2 indicates positive sentiment.
Dataset Creation
Once all the steps are completed, a new pipeline is established, designed to generate a dataset that includes sentiment scores and the countries identified from the text of Reddit posts and comments.
By applying this pipeline to other subreddits, such as those focused on Traveling Tips, World News, Continents and Countries, user sentiment can be captured. This approach enables a structured analysis of sentiment towards the mentioned countries, providing valuable insights into usersβ perspectives across various discussions.
At the time of this publication, a total of 444,059 up-to-date unique comments were collected, categorized by countries, and classified by sentiment using the pipeline. After expanding the countries list, the dataset grew to 870,066 comments for each country.
Data Visualization
Now that the dataset is cleaned and prepared, itβs time to visualize the results to uncover insights and trends. In this section, the Plotly library will be used to explore the data further.
Letβs start by analyzing the volume of comments for each country.
Sentiment Over Time
In this section, letβs explore how sentiments have evolved over time. By analyzing sentiment scores alongside the date of posts, it becomes possible to uncover how sentiments have evolved across different countries.
Visualizing sentiments in relation to the timeline reveals significant changes in tone and emotional context. Specific periods can also be explored where sentiment shifts may correlate with historical events.
For deeper insights, the data can be filtered by individual countries, allowing for the uncovering of unique sentiment trends and potentially connecting these shifts to major geopolitical events that occurred during those times.
Global Sentiment Overview
Next, letβs analyze the sentiment tone for each country. Weβll start by examining the broader sentiment trends before diving into more specific details.
By selecting the column for flair and associating it with the sentiment score, the overall sentiment associated with each postβs tag can be analyzed.
Upon reviewing the volume of comments graphs, itβs clear that the number of comments varies significantly across countries. To address this problem, the sentiment value is calculated relative to the total number of comments for each country. This ensures that countries with fewer comments are not overshadowed by those with larger volumes.
For a deeper exploration, you can interact with the sentiment maps below. These maps provide a detailed visual representation of how sentiment varies globally, highlighting the percentages of positive, negative, and neutral sentiment across different countries.
Additionally, by calculating the difference between positive and negative sentiments, the sentiment balance can be visualized on a single map.
To complement the analysis, the focus shifts to countries with the highest ratios of positive, negative, and neutral sentiment values. To ensure accuracy and relevance, only countries with a substantial number of comments are considered.
Interestingly, countries with the most positive sentiment often share characteristics like a high quality of life, natural beauty, political stability, a discreet presence in international politics, or popularity as tourist destinations. These nations tend to enjoy a more favorable global perception.
On the other hand, countries with the most negative sentiment face significant challenges, such as the Russia-Ukraine and Israel-Palestine conflicts, which have intensified political and economic instability. These, along with internal conflicts, restrictions on civil liberties, humanitarian crises, or recent political developments, can contribute to a widespread negative perception.
Meanwhile, countries with the most neutral sentiment tend to have a balanced distribution, with scores clustered around the middle. This is because most countries evoke moderate, mixed opinions rather than extreme ones, making neutral sentiments more common, while positive and negative scores are less frequent and more spread out.
Conclusion
The results of this study, grounded in Geopolitical Analysis, partially validate the accuracy of the data generated by the RoBERTa model for Sentiment Analysis. These results emphasize the modelβs effectiveness in identifying the sentiment tone embedded within unstructured data, providing valuable insights into a countryβs global reputation and overall sentiment.
By adopting an interdisciplinary approach, the project also integrates Sentiment Analysis with Geopolitical Analysis, opening the door to more profound studies across disciplines such as International Relations.
Another key aspect of this project was the creation of a database containing 444,059 up-to-date, unique Reddit comments, each labeled with corresponding countries and sentiment values by the RoBERTa model.
I hope this project sparks your interest or proves helpful in some way! If you have any questions or thoughts, feel free to leave a comment, or connect with me on LinkedIn.
Sources
[1] Hugging Face, Transformers β Token classification
[2] Hugging Face, NLP Course β Token classification
[3] Hugging Face, Jean-Baptiste/roberta-large-ner-english
[4] Hugging Face, cardiffnlp/twitter-roberta-base-sentiment-latest
[5] PRAW, The Python Reddit API Wrapper (PRAW)
[6] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer and V. Stoyanov, Roberta: A robustly optimized bert pretraining approach (2019)
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI