Country Recognition and Geolocated Sentiment Analysis Using the RoBERTa Model

Last Updated on February 10, 2025 by Editorial Team

Author(s): Pedro Markovicz

Originally published on Towards AI.

Country Recognition and Geolocated Sentiment Analysis Using the RoBERTa Model

Have you ever wondered how public opinion about a country shapes its global image? From travel reviews to political debates on social media, people’s opinions often carry an emotional tone that can reveal intriguing regional patterns. What if we could map these sentiments across the globe and uncover insights that go beyond just words?

That’s where Geolocated Sentiment Analysis comes in. Sentiments aren’t just personal, they’re influenced by culture, region, and national identity. By building an up-to-date dataset from Social Media comments and mapping them to specific countries and associated sentiments, we can gain deeper insights into these emotions.

In essence, this study combines Country Recognition with Sentiment Analysis, leveraging the RoBERTa NLP model for Named Entity Recognition (NER) and Sentiment Classification to explore how sentiments vary across different geographical regions.

Data Extraction

This project explores how data from Reddit, a widely used platform for discussions and content sharing, can be utilized to analyze global sentiment trends.

The data collection was performed using PRAW (Python Reddit API Wrapper), which enabled the extraction of relevant content from communities focused on Travel, News, Continents and Countries. This approach allows for an assessment of global sentiment trends as explored in this project.

The PRAW API enables the extraction of the following data:

Title: The title of the post provided by the original poster;
Comment: A specific comment made by a user on the post;
Flair: The category or tag selected by the original poster for the post;
Date: The exact year, month, and day the post was created.

# Import the necessary libraries
import praw
import datetime
import pandas as pd

# Initialize Reddit API connection
reddit = praw.Reddit(
 client_id="your_client_id", 
 client_secret="your_client_secret", 
 user_agent="your_user_agent", 
)

# Define subreddit and fetch top posts
subreddit_name = 'travel'
subreddit = reddit.subreddit(subreddit_name)
posts = subreddit.top(limit=None)

# Set criteria for collecting posts and comments
min_comments_per_post = 2
max_rows = 50000
data = []

# Iterate over the top posts
for post in posts:
 if post.num_comments >= min_comments_per_post: 
 post_data = {
 'title': f"{post.title} / {post.selftext}", 
 'date': datetime.datetime.utcfromtimestamp(post.created_utc),
 'flair': post.link_flair_text, 
 }
 
 # Extract comments from the post
 comments = [comment.body for comment in post.comments if isinstance(comment, praw.models.Comment)]
 
 # Store each comment along with the post data
 for comment in comments:
 data.append({
 'title': post_data['title'],
 'comment': comment,
 'date': post_data['date'],
 'flair': post_data['flair'], 
 })
 
 # Stop collecting data once the row limit is reached
 if len(data) >= max_rows:
 break
 
 # Break out of the main loop if the row limit is reached
 if len(data) >= max_rows:
 break

# Create a DataFrame from the collected data
df = pd.DataFrame(data, columns=['title', 'comment', 'date', 'flair', ])

The figure below presents an example, showcasing the first five unique entries extracted from the dataset.

The Challenges of NLP: NER and Sentiment Analysis

Extracting emotions values in text and associating them with specific regions isn’t straightforward. It involves tackling two main NLP challenges:

Named Entity Recognition (NER): Identifying and linking mentions of countries within text.
Sentiment Analysis: Determining whether the emotional tone of the text is positive, negative, or neutral.

Both challenges will be addressed using RoBERTa, a state-of-the-art Transformer-based machine learning model. RoBERTa is an optimized variant of BERT, designed to improve the pretraining process and fine-tune hyperparameters, leading to enhanced performance across a wide range of natural language processing tasks.

Named Entity Recognition (NER)

To address the NER challenge, a pre-trained RoBERTa model designed for Named Entity Recognition tasks was utilized. The model, sourced from HuggingFace, is:

Jean-Baptiste/roberta-large-ner-english

To optimize the model’s search process, the algorithm initially iterates through lists and dictionaries containing cities, capitals, demonyms, and political leaders associated with the respective countries, considering both the title and comment columns. Any values it fails to identify are then passed to the NER model for further processing.

# Import the necessary libraries
import re
import pandas as pd
from transformers import pipeline

def extract_countries(row, demonym_mapping, city_to_country, countries_list, ner_pipeline):
 # Helper function to search for countries/cities/demonyms in the text
 def find_matches_in_text(text, search_dict):
 found_countries = set()
 for key, country in search_dict.items():
 pattern = r'\b' + re.escape(key.lower()) + r'\b'
 if re.search(pattern, text):
 found_countries.add(country)
 return found_countries

 # Prepare the text
 text = row['post_title_body'].lower() + ' ' + row['comment_body'].lower()

 # Search in the static list of countries
 found_countries = find_matches_in_text(text, {country: country for country in countries_list})
 
 # Search for demonyms and political leaders
 found_countries.update(find_matches_in_text(text, demonym_mapping))
 
 # Search for cities
 found_countries.update(find_matches_in_text(text, city_to_country))
 
 # If no country is found, use the RoBERTa model
 if not found_countries:
 found_countries.update(extract_countries_with_roberta(text, ner_pipeline, countries_list))

 return list(found_countries) if found_countries else None

def extract_countries_with_roberta(text, ner_pipeline, countries_list):
 # Apply the NER pipeline to the text
 found_countries = set()
 ner_results = ner_pipeline(text)
 
 # Iterate over the entities recognized by the NER model
 for entity in ner_results:
 
 # Check if the recognized entity is a country
 if 'LOC' in entity.get('entity', ''): 
 entity_text = entity['word'].lower()

 # Iterate over the list of countries and check if any of them is contained in the recognized entity
 for country in countries_list:
 if country.lower() in entity_text:
 found_countries.add(country)
 
 return found_countries

# Load the NER pipeline with the RoBERTa model
ner_pipeline = pipeline("ner", model="Jean-Baptiste/roberta-large-ner-english")

# Apply the extract_countries function to the DataFrame
df['countries'] = df.apply(lambda row: extract_countries(
 row=row,
 demonym_mapping=demonym_mapping,
 city_to_country=city_to_country,
 countries_list=countries_list,
 ner_pipeline=ner_pipeline
), axis=1)

Before feeding the text data into the NER model, a thorough cleaning is performed to ensure it’s optimally prepared. This cleaning process includes removing special characters, converting all text to lowercase, and eliminating stopwords, all of which enhance the efficiency of NLP tokenization process.

Once the data is cleaned, it passes through the country recognition pipeline, where a new column titled countries is added to the dataset. This column stores a list of all the countries mentioned in the text.

Sentiment Analysis

For the Sentiment Analysis challenge, a pre-trained RoBERTa model specifically designed for sentiment classification was used. The model, sourced from Hugging Face, is:

cardiffnlp/twitter roberta-base-sentiment-latest

# Import the necessary libraries
from transformers import RobertaTokenizerFast, RobertaForSequenceClassification

# Load the tokenizer and pre-trained RoBERTa model for sentiment analysis
tokenizer = RobertaTokenizerFast.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment-latest")
model = RobertaForSequenceClassification.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment-latest")

# Function to calculate sentiment
def calculate_sentiment(comment_text):
 # Tokenize and prepare the input data
 encoding = tokenizer(comment_text, return_tensors='pt', padding=True, truncation=True, max_length=128)
 input_ids = encoding['input_ids']
 attention_mask = encoding['attention_mask']
 
 # Make prediction without gradients
 with torch.no_grad():
 outputs = model(input_ids=input_ids, attention_mask=attention_mask)
 _, preds = torch.max(outputs.logits, dim=1)
 
 return preds.item()

# Apply the function to the comments in the DataFrame
df['sentiment'] = df['comment_body'].apply(calculate_sentiment)

By integrating this pre-trained model into the pipeline, the sentiment score is computed for the text in the comment column. A score of 0 indicates negative sentiment, 1 indicates neutral sentiment, and 2 indicates positive sentiment.

Dataset Creation

Once all the steps are completed, a new pipeline is established, designed to generate a dataset that includes sentiment scores and the countries identified from the text of Reddit posts and comments.

By applying this pipeline to other subreddits, such as those focused on Traveling Tips, World News, Continents and Countries, user sentiment can be captured. This approach enables a structured analysis of sentiment towards the mentioned countries, providing valuable insights into users’ perspectives across various discussions.

At the time of this publication, a total of 444,059 up-to-date unique comments were collected, categorized by countries, and classified by sentiment using the pipeline. After expanding the countries list, the dataset grew to 870,066 comments for each country.

Data Visualization

Now that the dataset is cleaned and prepared, it’s time to visualize the results to uncover insights and trends. In this section, the Plotly library will be used to explore the data further.

Let’s start by analyzing the volume of comments for each country.

[Click here to view the interactive map]

Top 25 countries with the highest number of comments

Sentiment Over Time

In this section, let’s explore how sentiments have evolved over time. By analyzing sentiment scores alongside the date of posts, it becomes possible to uncover how sentiments have evolved across different countries.

Visualizing sentiments in relation to the timeline reveals significant changes in tone and emotional context. Specific periods can also be explored where sentiment shifts may correlate with historical events.

For deeper insights, the data can be filtered by individual countries, allowing for the uncovering of unique sentiment trends and potentially connecting these shifts to major geopolitical events that occurred during those times.

Global Sentiment Overview

Next, let’s analyze the sentiment tone for each country. We’ll start by examining the broader sentiment trends before diving into more specific details.

By selecting the column for flair and associating it with the sentiment score, the overall sentiment associated with each post’s tag can be analyzed.

Upon reviewing the volume of comments graphs, it’s clear that the number of comments varies significantly across countries. To address this problem, the sentiment value is calculated relative to the total number of comments for each country. This ensures that countries with fewer comments are not overshadowed by those with larger volumes.

For a deeper exploration, you can interact with the sentiment maps below. These maps provide a detailed visual representation of how sentiment varies globally, highlighting the percentages of positive, negative, and neutral sentiment across different countries.

Additionally, by calculating the difference between positive and negative sentiments, the sentiment balance can be visualized on a single map.

To complement the analysis, the focus shifts to countries with the highest ratios of positive, negative, and neutral sentiment values. To ensure accuracy and relevance, only countries with a substantial number of comments are considered.

Interestingly, countries with the most positive sentiment often share characteristics like a high quality of life, natural beauty, political stability, a discreet presence in international politics, or popularity as tourist destinations. These nations tend to enjoy a more favorable global perception.

On the other hand, countries with the most negative sentiment face significant challenges, such as the Russia-Ukraine and Israel-Palestine conflicts, which have intensified political and economic instability. These, along with internal conflicts, restrictions on civil liberties, humanitarian crises, or recent political developments, can contribute to a widespread negative perception.

Meanwhile, countries with the most neutral sentiment tend to have a balanced distribution, with scores clustered around the middle. This is because most countries evoke moderate, mixed opinions rather than extreme ones, making neutral sentiments more common, while positive and negative scores are less frequent and more spread out.

Conclusion

The results of this study, grounded in Geopolitical Analysis, partially validate the accuracy of the data generated by the RoBERTa model for Sentiment Analysis. These results emphasize the model’s effectiveness in identifying the sentiment tone embedded within unstructured data, providing valuable insights into a country’s global reputation and overall sentiment.

By adopting an interdisciplinary approach, the project also integrates Sentiment Analysis with Geopolitical Analysis, opening the door to more profound studies across disciplines such as International Relations.

Another key aspect of this project was the creation of a database containing 444,059 up-to-date, unique Reddit comments, each labeled with corresponding countries and sentiment values by the RoBERTa model.

I hope this project sparks your interest or proves helpful in some way! If you have any questions or thoughts, feel free to leave a comment, or connect with me on LinkedIn.

Sources

[1] Hugging Face, Transformers — Token classification
[2] Hugging Face, NLP Course — Token classification
[3] Hugging Face, Jean-Baptiste/roberta-large-ner-english
[4] Hugging Face, cardiffnlp/twitter-roberta-base-sentiment-latest
[5] PRAW, The Python Reddit API Wrapper (PRAW)
[6] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer and V. Stoyanov, Roberta: A robustly optimized bert pretraining approach (2019)

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Country Recognition and Geolocated Sentiment Analysis Using the RoBERTa Model

Author(s): Pedro Markovicz

Country Recognition and Geolocated Sentiment Analysis Using the RoBERTa Model

Data Extraction

The Challenges of NLP: NER and Sentiment Analysis

Named Entity Recognition (NER)

Sentiment Analysis

Dataset Creation

Data Visualization

Sentiment Over Time

Global Sentiment Overview

Conclusion

Sources

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Country Recognition and Geolocated Sentiment Analysis Using the RoBERTa Model

Author(s): Pedro Markovicz

Country Recognition and Geolocated Sentiment Analysis Using the RoBERTa Model

Data Extraction

The Challenges of NLP: NER and Sentiment Analysis

Named Entity Recognition (NER)

Sentiment Analysis

Dataset Creation

Data Visualization

Sentiment Over Time

Global Sentiment Overview

Conclusion

Sources

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥