Predicting the 2024 U.S. Presidential Election Winner Using Machine Learning
Author(s): Sanjay Nandakumar
Originally published on Towards AI.
Table of Contents
- Introduction
- Methodology Overview
- Exploratory Data Analysis (EDA) Analysis
- Data Preparation (Synthetic Data)
- Feature Engineering and Selection
- Model Selection and Training
- Model Evaluation and Metrics
- Model Interpretability and Feature Importance
- Conclusion
- Future Directions
Introduction
The world is focusing its eyes on the 2024 U.S. presidential election, which is being held between the two big parties, Republican and Democratic. While traditional opinion polls provide a pretty good snapshot, machine learning certainly goes deeper with its data-driven perspective on things. In this paper, we will explore how one can predict elections through machine learning, pretending synthetic data to emulate socioeconomic and political factors and discuss the techniques necessary to utilize machine learning in real-world complex situations.
One fact is that machine learning has begun changing data-driven political analysis. With the end goal of predicting the U.S. Presidential Election outcome, we seek to model the complex relationships among socioeconomic factors, political leanings, and media consumption. Predicting the elections, however, presents challenges unique to it, such as the dynamic nature of voter preferences, non-linear interactions, and latent biases in the data.
The points to cover in this article are as follows:
- Generating synthetic data to illustrate ML modelling for election outcomes.
- Discussing technical concepts such as feature engineering, model validation, and interoperability.
- Providing some insights into how data scientists might approach real-life election predictions.
Methodology Overview
In our work, we follow these steps:
- Data Generation: Generate a synthetic dataset that contains effects on the behaviour of voters.
- Exploratory Data Analysis: Perform exploratory data analysis to understand the featuresβ distributions, relationships, and correlations.
- Feature Engineering and Selection: Feature engineering is used to capture certain important patterns, and feature selection is picked only by important ones.
- Model Fitting and Training: Various ML models trained on sub-patterns in data.
- Evaluation-and-Interpretation: Performance measure-and-interpretability allows for the prediction of the model to be deciphered.
Data Preparation (Synthetic Data)
Generating a Dataset
Synthetic data constituting age, education, income, political alignment, media consumption, and the target variable-party affiliation will be generated in the same way as real-world voting behaviour.
import pandas as pd
import numpy as np
# Set a random seed for reproducibility
# Ensures that the randomly generated data will be the same each time the code runs
np.random.seed(42)
# Define the number of samples (voters) to generate
n_samples = 1000
# Generate synthetic data for 1000 voters
data = {
'age': np.random.randint(18, 90, n_samples), # Random ages between 18 and 89
'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n_samples), # Random educational levels
'income': np.random.normal(60000, 15000, n_samples), # Normally distributed income around 60,000 with std. dev. of 15,000
'political_alignment': np.random.choice(['Conservative', 'Liberal', 'Moderate'], n_samples), # Random political alignment
'media_consumption': np.random.choice(['Conservative', 'Liberal', 'Mixed'], n_samples), # Preferred media type
'party_affiliation': np.random.choice(['Republican', 'Democrat'], n_samples) # Randomly assigned political party affiliation
}
# Create a DataFrame from the generated data
df = pd.DataFrame(data)
# Display the first few rows of the DataFrame to inspect the generated data
print(df.head())
Descriptions of the Features:
- Age: Voters belonging to diverse groups based on their age may have varying political orientations: young voters tend to be more progressive, while probably the older crowd may be conservative.
- Education: Education is well-correlated with political orientation, influencing voting behaviour.
- Income: Economic considerations form the greatest motivating reason of all political activism, people support specific parties primarily based on their economic agenda.
- Political Alignment: Self-reported alignment (Conservative, Liberal, Moderate) gives rise to the kinds of anticipated voting behaviour.
- Media Consumption: Choices of media can fortify ideological beliefs and affect opinion formation.
Target Variable:
- Party Affiliation: Denotes whether a respondent tends to be a supporter of either the Republic or Democrat party
Exploratory Data Analysis (EDA) Analysis
A study at the invested high level of exploratory data analysis (EDA) will serve the aim of understanding distributions, relationships, and potential problems in the data. EDA will start by visualizing the distribution of the various features and the relationships between the features and the target variable.
Important Steps of EDA:
- Distribution analysis: Plot the distribution of continuous variables such as age and income.
- Bivariate analysis: Investigate features that potentially influence the target variable.
import seaborn as sns
import matplotlib.pyplot as plt
# Plot the distribution of voter ages
plt.figure(figsize=(10, 5)) # Set the figure size
sns.histplot(df['age'], kde=True) # Plot a histogram with a kernel density estimate (kde) for age
plt.title('Age Distribution of Voters') # Set the title of the plot
plt.show() # Display the plot
# Plot the distribution of voter incomes
plt.figure(figsize=(10, 5)) # Set the figure size
sns.histplot(df['income'], kde=True) # Plot a histogram with kde for income
plt.title('Income Distribution of Voters') # Set the title of the plot
plt.show() # Display the plot
Analysis
This preliminary analysis helps identify patterns in age, income, and their impact on party affiliation. For example, if younger voters lean Democrat and older voters lean Republican, this pattern can guide feature engineering and model selection.
Feature Engineering and Selection
Encoding Categorical Variables
Since models interpret numerical data, categorical variables like education and political alignment need to be encoded. Weβll use one-hot encoding to capture categorical distinctions effectively.
# Encode categorical variables using one-hot encoding
# This converts each category into a binary column (0 or 1), excluding the first category (drop_first=True) to avoid multicollinearity
df_encoded = pd.get_dummies(df, columns=['education', 'political_alignment', 'media_consumption'], drop_first=True)
# Convert 'party_affiliation' to a binary variable
# Use 1 for 'Democrat' and 0 for 'Republican' to simplify analysis
df_encoded['party_affiliation'] = df_encoded['party_affiliation'].apply(lambda x: 1 if x == 'Democrat' else 0)
# Display the first few rows of the encoded DataFrame to verify the transformations
print(df_encoded.head())
Use correlation matrices to identify highly correlated features after encoding β
# Perform and visualize correlation analysis
# Note: This requires 'df_encoded' to be a version of 'df' where categorical data has been encoded numerically
sns.heatmap(df_encoded.corr(), annot=True, cmap="coolwarm") # Plot the heatmap of correlations with annotations
plt.title("Correlation Matrix") # Set the title for the correlation matrix plot
plt.show() # Display the plot
Feature Selection
Features are selected based on correlation analysis, domain knowledge, and potential interactions with other features.
Model Selection and Training
For this task, weβll use three models, each offering unique advantages:
- Logistic Regression: Useful for baseline performance with interpretability in binary classification.
- Random Forest: Captures non-linear relationships through ensemble decision trees.
- Gradient Boosting: Sequentially builds trees, focusing on residual errors to improve accuracy.
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
# Split data into features and target
X = df_encoded.drop('party_affiliation', axis=1)
y = df_encoded['party_affiliation']
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Initialize models
log_reg = LogisticRegression(random_state=42, max_iter=500)
rf = RandomForestClassifier(random_state=42)
gb = GradientBoostingClassifier(random_state=42)
# Train and evaluate each model
models = {'Logistic Regression': log_reg, 'Random Forest': rf, 'Gradient Boosting': gb}
for name, model in models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"{name} Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print(classification_report(y_test, y_pred))
Model Explanations in Detail:
- Logistic Regression: Baseline. Interpretable and straightforward in deciphering linear oxidation.
- Random Forest: This is a very powerful ensemble method that reduces variance by averaging among multiple decision trees and thus is more satisfactory in preventing overfitting.
- Gradient boosting: It concentrates on an iterative method of correcting sequential errors, therefore making it especially strong for datasets that have complex interactions.
Model Evaluation and Metrics
Evaluating metrics is of enormous significance towards scoring the performance of the model. The focus is on:
- Accuracy: A measure of overall performance.
- Precision and Recall: Critical in election predictions, to avoid misclassifying party affiliations.
- F-Measure: Balances precision and recall and is useful in the case of imbalanced data.
Model Interpretability and Feature Importance
SHAP for Model Interpretability
SHAP, which stands for Shapley Additive Explanations, reveals which features are driving predictions made by our model. By attributing feature contributions to predictions, SHAP helps to unlock the black box of ensemble methods.
import shap
# Initialize a SHAP explainer for the Gradient Boosting model
# This calculates the SHAP values, which explain the influence of each feature on predictions
explainer = shap.TreeExplainer(gb)
# Calculate SHAP values for the test set
# SHAP values represent the impact of each feature on the predictions for each instance in the test set
shap_values = explainer.shap_values(X_test)
# Display a SHAP summary plot
# The summary plot provides a bar chart of mean absolute SHAP values, showing feature importance
# 'plot_type="bar"' displays the average magnitude of feature impacts across the test set
shap.summary_plot(shap_values, X_test, plot_type="bar", feature_names=X.columns)
SHAP Analysis Explanation
The SHAP summary plot reveals which features contribute the most to the modelβs predictions. For example, if age and income show the highest SHAP values, we interpret these as key predictors for party affiliation in our model. This interpretability is especially crucial for political analysis, where understanding voter behaviour factors is as important as accurate prediction.
Conclusion
Summary
Through this exercise, we simulated the predictions of voting outcomes with machine learning models based on synthetic data. The insights were unique for each model:
Overall summary of selected models based on evaluation metrics:
Logistic Regression
Accuracy: 0.53
Insight: Logistic Regression gave the best accuracy among all models (almost a little better than random guessing) with balanced precision and recall across classes. Though interpretable, the Logistic Regression did not generalize so well for the problem context, which indicates its inability to handle non-linear and complex relationships that the data holds ingrained therein.
Random Forest
Accuracy: 0.50
Insight: Random Forest gave an accuracy close to random guessing (i.e., at 0.50) with balanced precision and recall and could not surpass Logistic Regression. This result indicates that either there is an overfitting process going on or that Random Forest faced difficulties due to some relatively low predictive power of the synthetic features.
Gradient Boosting:
Accuracy: 0.48
Insight: Gradient Boosting performed the worst, with an accuracy of 0.48, indicating that it probably did not learn sufficiently from the synthetic dataset, or the dataset avoided any sophisticated features that would be recognized. The precision and recall for each class were nearly equal to those of Logistic Regression, which indicates limited prediction-differentiation power.
All models exhibited weak predictive performance, most probably due to either the overly simplistic nature of the synthetic data or the small size of the dataset. Accuracy scores are nearing the chances of random guessing and tell that there is much room for adding more comprehensive and compound features for improving predictive performances in a real-world setting. This experiment showcases the emphasis on quality data collection and robust feature engineering when modelling complex events, such as elections.
Key Takeaways
- Feature Importance: With the models performing pretty weakly, feature importance analysis may show which features, although weakly, contributed to the predictions. When used, SHAP values would help identify the most important features such as age, income, and political alignment; however, they may not have been powerful predictors for this synthetic dataset. This analysis underlines the need for informative and influential features in political prediction models, given that current features like income and media consumption may not adequately capture the complexities of voting behaviour.
- Model Performance: All models Logistic Regression, Random Forest, and Gradient Boosting-have demonstrated very little predictive ability, with accuracies near to that of random guessing (0.50). Logistic Regression, although it scored the highest accuracy of 0.53, still indicates that the models must have struggled to find any meaningful patterns in the simulated data.
- Model Selection: Logistic Regression had the best overall performance, although there were no significant differences among models. This suggests that with a simple dataset, a simpler model should be taken up. Random Forest and Gradient Boosting improvement was not realized at expected levels, likely due to overfitting or lack of high-quality features.
- Interpretability and metrics: All models have similar precision and recall among the classes, but the low F1 scores suggest that no one model was successful in distinguishing the classes. Very low performance, therefore, points to the need for additional feature engineering or collection of more informative data to increase interpretability and statistical power in election predictions.
Future Directions
For real-world applications, there must be improvements:
- Real-World Data: Use real-world votersβ demographic data and polling data for more accuracy.
- Advanced Techniques: Consider more advanced models, such as XGBoost or neural networks, for capturing more subtle interactions.
- Time-Series Analysis: Use historical voting data and time trends in polling to capture the development of attitudes among voters.
- Incorporate NLP: Evaluate public opinion from social media or news articles to constitute sentiment in the predictive features.
This views how the processes of data science and machine learning reveal insights into election outcomes. While we have presented synthesized data for demonstration, these techniques will serve as valuable intuitive aids for understanding and forecasting voting patterns, and predictive insight in high-stakes decision-making.
Note: The analysis is educational. Practically, predicting the outcome of an election is dependent on having comprehensive datasets and validation thereof, both on them and on a lot of external contextual factors.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI