Can Machine Learning Predict Air Quality Before It Gets Dangerous?
Last Updated on April 26, 2025 by Editorial Team
Author(s): S Aishwarya
Originally published on Towards AI.
Can Machine Learning Predict Air Quality Before It Gets Dangerous?
In recent years, air pollution has moved from being an abstract environmental issue to a daily reality for millions around the globe. From smog-filled cities to alarming health advisories, poor air quality affects how we live, breathe, and move.
But what if we could see it coming?
What if we could predict tomorrowβs Air Quality Index (AQI) just like we predict the weather β and take action before the air turns toxic?
Thatβs where machine learning steps in.
🌍 Why Should We Predict Air Quality?
Air pollution isnβt just an inconvenience β itβs a health crisis. According to the WHO, itβs responsible for nearly 7 million premature deaths each year. High AQI levels are associated with asthma attacks, respiratory diseases, cardiovascular issues, and even cognitive decline.
If we can predict AQI accurately:
- 🚨 Communities can receive early warnings.
- 🏙οΈ Urban planners can design smarter, cleaner cities.
- 🧍β♂οΈ Individuals can decide when itβs safe to go for a jog or send kids to play outside.
So the million-dollar question is:
Can machine learning help us predict air quality in time to protect our health?
Letβs find out.
Dataset Overview
🌫οΈ Air Quality Data in India (2015β2020)
📌 Link: DATASET
📝 Overview
This dataset contains daily air quality data from major cities across India, collected between 2015 and 2020. It includes concentrations of various pollutants, meteorological parameters, and calculated AQI values.
📁 Files Included
city_day.csv
: Daily air quality data per citystation_day.csv
: Daily air quality data per stationstations.csv
: Metadata for each monitoring station
we are going to use city_day.csv
Each row in this file corresponds to a single day of air quality data for a specific city.
Absolutely! Hereβs a more engaging, human-like rewrite of the article with a storytelling vibe β perfect for blog posts or platforms like Medium or GeeksforGeeks:
🧠 Our Approach: Classical ML vs Ensemble Learning
Letβs walk through a complete ML workflow using Python to forecast AQI, comparing two models: Linear Regression and Random Forest Regressor.
🔹 Linear Regression β simple, fast, and interpretable.
🔹 Random Forest Regressor β powerful, robust, and accurate.
Step 1: Loading the Data
Weβre using the Air Quality in India dataset from Kaggle, which tracks pollutant levels and AQI values across Indian cities from 2015 to 2020.
import pandas as pd
df = pd.read_csv("/content/city_day.csv")
This dataset includes measurements for PM2.5, PM10, NOβ, CO, and other major pollutants β basically, the stuff thatβs floating around in the air we breathe.
🧹 Step 2: Cleaning Things Up
Like most real-world data, this oneβs a bit messy. So, letβs tidy it up:
df = df.dropna(subset=['AQI']) # Drop rows with missing target
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df.fillna(df.median(numeric_only=True), inplace=True)
df = df.drop(columns=['Date', 'City', 'AQI_Bucket'])
Dropping missing AQI values
Filling gaps with column medians
Removing columns we wonβt use (like city names or categories)
🔍 Step 3: Finding What Really Affects AQI
Now itβs time to explore β whatβs actually influencing air quality?
import seaborn as sns
import matplotlib.pyplot as plt
correlation_matrix = df.corr()
plt.figure(figsize=(15, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap with AQI")
plt.show()
🧪 Spoiler alert: PM2.5, PM10, NOβ, NO, CO β these are some of the biggest troublemakers.
🎯 Step 4: Feature Selection
Weβll focus on the top pollutants that have a high correlation with AQI.
selected_features = ['CO', 'PM2.5', 'NO2', 'SO2', 'NOx', 'PM10', 'NO']
X = df[selected_features]
y = df['AQI']
This helps keep the model lean and focused.
🧪 Step 5: Train-Test Split & Scaling
Letβs prep our data for the ML models:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
🔄 We split our data into training and testing sets
📏 We scale the features so that models like Linear Regression donβt get confused by large numbers
🤖 Step 6: Time to Train Our Models!
Letβs see what our two models can do.
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
lr_preds = lr.predict(X_test_scaled)
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
rf_preds = rf.predict(X_test)
📊 Step 7: How Well Did They Do?
Letβs measure how close the predictions are to the actual AQI values.
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
def evaluate_model(name, y_true, y_pred):
print(f"\n{name} Results:")
print(f"MAE: {mean_absolute_error(y_true, y_pred):.2f}")
mse = mean_squared_error(y_true, y_pred)
print(f"MSE: {mse:.2f}")
print(f"RMSE: {np.sqrt(mse):.2f}")
print(f"R2 Score: {r2_score(y_true, y_pred):.2f}")
plt.figure(figsize=(8, 5))
sns.scatterplot(x=y_true, y=y_pred, alpha=0.4)
plt.plot([y_true.min(), y_true.max()], [y_true.min(), y_true.max()], 'r--')
plt.xlabel("Actual AQI")
plt.ylabel("Predicted AQI")
plt.title(f"{name} - Actual vs Predicted AQI")
plt.grid(True)
plt.show()
evaluate_model("Linear Regression", y_test, lr_preds)
evaluate_model("Random Forest", y_test, rf_preds)
Comparative Analysis: Linear Regression vs Random Forest
⚖οΈ When to Use What?
Choose Linear Regression if:
- You want interpretable results
- Your dataset is small or clean
- You want a fast, lightweight model
Choose Random Forest if:
- You need high accuracy
- Your data has non-linear relationships
- Youβre okay with a black-box approach
💡 Final Thoughts
Machine learning isnβt just about numbers on a screen β itβs about unlocking insights that can change lives.
By harnessing environmental data, weβre not only predicting air quality β weβre empowering people to act before it becomes dangerous. Whether itβs helping parents decide if itβs safe for their kids to play outside or aiding city planners in reducing pollution hotspots, every insight brings us one step closer to healthier communities.
From simple, transparent models like Linear Regression to the high-performing Random Forests, weβve seen how different algorithms can serve different needs. But in the end, the goal is the same:
Better data β Smarter decisions β Cleaner air.
As we look ahead, imagine combining this with real-time sensor feeds, satellite data, or weather forecasts. The possibilities are vast β and they start with projects just like this.
Letβs keep innovating β one prediction, one breath, and one line of code at a time. 🌿💻🌍
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI