Can Machine Learning Predict Air Quality Before It Gets Dangerous?
Last Updated on April 26, 2025 by Editorial Team
Author(s): S Aishwarya
Originally published on Towards AI.
Can Machine Learning Predict Air Quality Before It Gets Dangerous?
In recent years, air pollution has moved from being an abstract environmental issue to a daily reality for millions around the globe. From smog-filled cities to alarming health advisories, poor air quality affects how we live, breathe, and move.
But what if we could see it coming?
What if we could predict tomorrow’s Air Quality Index (AQI) just like we predict the weather — and take action before the air turns toxic?
That’s where machine learning steps in.
🌍 Why Should We Predict Air Quality?
Air pollution isn’t just an inconvenience — it’s a health crisis. According to the WHO, it’s responsible for nearly 7 million premature deaths each year. High AQI levels are associated with asthma attacks, respiratory diseases, cardiovascular issues, and even cognitive decline.
If we can predict AQI accurately:
- 🚨 Communities can receive early warnings.
- 🏙️ Urban planners can design smarter, cleaner cities.
- 🧍♂️ Individuals can decide when it’s safe to go for a jog or send kids to play outside.
So the million-dollar question is:
Can machine learning help us predict air quality in time to protect our health?
Let’s find out.
Dataset Overview
🌫️ Air Quality Data in India (2015–2020)
📌 Link: DATASET
📝 Overview
This dataset contains daily air quality data from major cities across India, collected between 2015 and 2020. It includes concentrations of various pollutants, meteorological parameters, and calculated AQI values.
📁 Files Included
city_day.csv
: Daily air quality data per citystation_day.csv
: Daily air quality data per stationstations.csv
: Metadata for each monitoring station
we are going to use city_day.csv
Each row in this file corresponds to a single day of air quality data for a specific city.

Absolutely! Here’s a more engaging, human-like rewrite of the article with a storytelling vibe — perfect for blog posts or platforms like Medium or GeeksforGeeks:
🧠 Our Approach: Classical ML vs Ensemble Learning
Let’s walk through a complete ML workflow using Python to forecast AQI, comparing two models: Linear Regression and Random Forest Regressor.
🔹 Linear Regression — simple, fast, and interpretable.
🔹 Random Forest Regressor — powerful, robust, and accurate.
Step 1: Loading the Data
We’re using the Air Quality in India dataset from Kaggle, which tracks pollutant levels and AQI values across Indian cities from 2015 to 2020.
import pandas as pd
df = pd.read_csv("/content/city_day.csv")
This dataset includes measurements for PM2.5, PM10, NO₂, CO, and other major pollutants — basically, the stuff that’s floating around in the air we breathe.
🧹 Step 2: Cleaning Things Up
Like most real-world data, this one’s a bit messy. So, let’s tidy it up:
df = df.dropna(subset=['AQI']) # Drop rows with missing target
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df.fillna(df.median(numeric_only=True), inplace=True)
df = df.drop(columns=['Date', 'City', 'AQI_Bucket'])
Dropping missing AQI values
Filling gaps with column medians
Removing columns we won’t use (like city names or categories)
🔍 Step 3: Finding What Really Affects AQI
Now it’s time to explore — what’s actually influencing air quality?
import seaborn as sns
import matplotlib.pyplot as plt
correlation_matrix = df.corr()
plt.figure(figsize=(15, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap with AQI")
plt.show()

🧪 Spoiler alert: PM2.5, PM10, NO₂, NO, CO — these are some of the biggest troublemakers.
🎯 Step 4: Feature Selection
We’ll focus on the top pollutants that have a high correlation with AQI.
selected_features = ['CO', 'PM2.5', 'NO2', 'SO2', 'NOx', 'PM10', 'NO']
X = df[selected_features]
y = df['AQI']
This helps keep the model lean and focused.
🧪 Step 5: Train-Test Split & Scaling
Let’s prep our data for the ML models:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
🔄 We split our data into training and testing sets
📏 We scale the features so that models like Linear Regression don’t get confused by large numbers
🤖 Step 6: Time to Train Our Models!
Let’s see what our two models can do.
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
lr_preds = lr.predict(X_test_scaled)
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
rf_preds = rf.predict(X_test)
📊 Step 7: How Well Did They Do?
Let’s measure how close the predictions are to the actual AQI values.
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
def evaluate_model(name, y_true, y_pred):
print(f"\n{name} Results:")
print(f"MAE: {mean_absolute_error(y_true, y_pred):.2f}")
mse = mean_squared_error(y_true, y_pred)
print(f"MSE: {mse:.2f}")
print(f"RMSE: {np.sqrt(mse):.2f}")
print(f"R2 Score: {r2_score(y_true, y_pred):.2f}")
plt.figure(figsize=(8, 5))
sns.scatterplot(x=y_true, y=y_pred, alpha=0.4)
plt.plot([y_true.min(), y_true.max()], [y_true.min(), y_true.max()], 'r--')
plt.xlabel("Actual AQI")
plt.ylabel("Predicted AQI")
plt.title(f"{name} - Actual vs Predicted AQI")
plt.grid(True)
plt.show()
evaluate_model("Linear Regression", y_test, lr_preds)
evaluate_model("Random Forest", y_test, rf_preds)


Comparative Analysis: Linear Regression vs Random Forest

⚖️ When to Use What?
Choose Linear Regression if:
- You want interpretable results
- Your dataset is small or clean
- You want a fast, lightweight model
Choose Random Forest if:
- You need high accuracy
- Your data has non-linear relationships
- You’re okay with a black-box approach
💡 Final Thoughts
Machine learning isn’t just about numbers on a screen — it’s about unlocking insights that can change lives.
By harnessing environmental data, we’re not only predicting air quality — we’re empowering people to act before it becomes dangerous. Whether it’s helping parents decide if it’s safe for their kids to play outside or aiding city planners in reducing pollution hotspots, every insight brings us one step closer to healthier communities.
From simple, transparent models like Linear Regression to the high-performing Random Forests, we’ve seen how different algorithms can serve different needs. But in the end, the goal is the same:
Better data → Smarter decisions → Cleaner air.
As we look ahead, imagine combining this with real-time sensor feeds, satellite data, or weather forecasts. The possibilities are vast — and they start with projects just like this.
Let’s keep innovating — one prediction, one breath, and one line of code at a time. 🌿💻🌍
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.