Can Machine Learning Predict Air Quality Before It Gets Dangerous?

Last Updated on April 26, 2025 by Editorial Team

Author(s): S Aishwarya

Originally published on Towards AI.

Can Machine Learning Predict Air Quality Before It Gets Dangerous?

In recent years, air pollution has moved from being an abstract environmental issue to a daily reality for millions around the globe. From smog-filled cities to alarming health advisories, poor air quality affects how we live, breathe, and move.

But what if we could see it coming?

What if we could predict tomorrow’s Air Quality Index (AQI) just like we predict the weather — and take action before the air turns toxic?

That’s where machine learning steps in.

Can Machine Learning Predict Air Quality Before It Gets Dangerous? — Photo by Amir Hosseini on Unsplash

🌍 Why Should We Predict Air Quality?

Air pollution isn’t just an inconvenience — it’s a health crisis. According to the WHO, it’s responsible for nearly 7 million premature deaths each year. High AQI levels are associated with asthma attacks, respiratory diseases, cardiovascular issues, and even cognitive decline.

If we can predict AQI accurately:

🚨 Communities can receive early warnings.
🏙️ Urban planners can design smarter, cleaner cities.
🧍‍♂️ Individuals can decide when it’s safe to go for a jog or send kids to play outside.

So the million-dollar question is:

Can machine learning help us predict air quality in time to protect our health?

Let’s find out.

Dataset Overview

🌫️ Air Quality Data in India (2015–2020)

📌 Link: DATASET

📝 Overview

This dataset contains daily air quality data from major cities across India, collected between 2015 and 2020. It includes concentrations of various pollutants, meteorological parameters, and calculated AQI values.

📁 Files Included

city_day.csv: Daily air quality data per city
station_day.csv: Daily air quality data per station
stations.csv: Metadata for each monitoring station

we are going to use city_day.csv

Each row in this file corresponds to a single day of air quality data for a specific city.

Absolutely! Here’s a more engaging, human-like rewrite of the article with a storytelling vibe — perfect for blog posts or platforms like Medium or GeeksforGeeks:

🧠 Our Approach: Classical ML vs Ensemble Learning

Let’s walk through a complete ML workflow using Python to forecast AQI, comparing two models: Linear Regression and Random Forest Regressor.

🔹 Linear Regression — simple, fast, and interpretable.
🔹 Random Forest Regressor — powerful, robust, and accurate.

Step 1: Loading the Data

We’re using the Air Quality in India dataset from Kaggle, which tracks pollutant levels and AQI values across Indian cities from 2015 to 2020.

import pandas as pd
df = pd.read_csv("/content/city_day.csv")

This dataset includes measurements for PM2.5, PM10, NO₂, CO, and other major pollutants — basically, the stuff that’s floating around in the air we breathe.

🧹 Step 2: Cleaning Things Up

Like most real-world data, this one’s a bit messy. So, let’s tidy it up:

df = df.dropna(subset=['AQI']) # Drop rows with missing target
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df.fillna(df.median(numeric_only=True), inplace=True)
df = df.drop(columns=['Date', 'City', 'AQI_Bucket'])

Dropping missing AQI values
Filling gaps with column medians
Removing columns we won’t use (like city names or categories)

🔍 Step 3: Finding What Really Affects AQI

Now it’s time to explore — what’s actually influencing air quality?

import seaborn as sns
import matplotlib.pyplot as plt
correlation_matrix = df.corr()
plt.figure(figsize=(15, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap with AQI")
plt.show()

🧪 Spoiler alert: PM2.5, PM10, NO₂, NO, CO — these are some of the biggest troublemakers.

🎯 Step 4: Feature Selection

We’ll focus on the top pollutants that have a high correlation with AQI.

selected_features = ['CO', 'PM2.5', 'NO2', 'SO2', 'NOx', 'PM10', 'NO']
X = df[selected_features]
y = df['AQI']

This helps keep the model lean and focused.

🧪 Step 5: Train-Test Split & Scaling

Let’s prep our data for the ML models:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

🔄 We split our data into training and testing sets
📏 We scale the features so that models like Linear Regression don’t get confused by large numbers

🤖 Step 6: Time to Train Our Models!

Let’s see what our two models can do.

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
lr_preds = lr.predict(X_test_scaled)
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train) 
rf_preds = rf.predict(X_test)

📊 Step 7: How Well Did They Do?

Let’s measure how close the predictions are to the actual AQI values.

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
def evaluate_model(name, y_true, y_pred):
 print(f"\n{name} Results:")
 print(f"MAE: {mean_absolute_error(y_true, y_pred):.2f}")
 mse = mean_squared_error(y_true, y_pred)
 print(f"MSE: {mse:.2f}")
 print(f"RMSE: {np.sqrt(mse):.2f}")
 print(f"R2 Score: {r2_score(y_true, y_pred):.2f}")
 
 plt.figure(figsize=(8, 5))
 sns.scatterplot(x=y_true, y=y_pred, alpha=0.4)
 plt.plot([y_true.min(), y_true.max()], [y_true.min(), y_true.max()], 'r--')
 plt.xlabel("Actual AQI")
 plt.ylabel("Predicted AQI")
 plt.title(f"{name} - Actual vs Predicted AQI")
 plt.grid(True)
 plt.show()
evaluate_model("Linear Regression", y_test, lr_preds)
evaluate_model("Random Forest", y_test, rf_preds)

Comparative Analysis: Linear Regression vs Random Forest

⚖️ When to Use What?

Choose Linear Regression if:

You want interpretable results
Your dataset is small or clean
You want a fast, lightweight model

Choose Random Forest if:

You need high accuracy
Your data has non-linear relationships
You’re okay with a black-box approach

💡 Final Thoughts

Machine learning isn’t just about numbers on a screen — it’s about unlocking insights that can change lives.

By harnessing environmental data, we’re not only predicting air quality — we’re empowering people to act before it becomes dangerous. Whether it’s helping parents decide if it’s safe for their kids to play outside or aiding city planners in reducing pollution hotspots, every insight brings us one step closer to healthier communities.

From simple, transparent models like Linear Regression to the high-performing Random Forests, we’ve seen how different algorithms can serve different needs. But in the end, the goal is the same:

Better data → Smarter decisions → Cleaner air.

As we look ahead, imagine combining this with real-time sensor feeds, satellite data, or weather forecasts. The possibilities are vast — and they start with projects just like this.

Let’s keep innovating — one prediction, one breath, and one line of code at a time. 🌿💻🌍

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Can Machine Learning Predict Air Quality Before It Gets Dangerous?

Author(s): S Aishwarya

Can Machine Learning Predict Air Quality Before It Gets Dangerous?

🌍 Why Should We Predict Air Quality?

Dataset Overview

🌫️ Air Quality Data in India (2015–2020)

📝 Overview

📁 Files Included

🧠 Our Approach: Classical ML vs Ensemble Learning

Step 1: Loading the Data

🧹 Step 2: Cleaning Things Up

🔍 Step 3: Finding What Really Affects AQI

🎯 Step 4: Feature Selection

🧪 Step 5: Train-Test Split & Scaling

🤖 Step 6: Time to Train Our Models!

📊 Step 7: How Well Did They Do?

Comparative Analysis: Linear Regression vs Random Forest

⚖️ When to Use What?

💡 Final Thoughts

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Can Machine Learning Predict Air Quality Before It Gets Dangerous?

Author(s): S Aishwarya

Can Machine Learning Predict Air Quality Before It Gets Dangerous?

🌍 Why Should We Predict Air Quality?

Dataset Overview

🌫️ Air Quality Data in India (2015–2020)

📝 Overview

📁 Files Included

🧠 Our Approach: Classical ML vs Ensemble Learning

Step 1: Loading the Data

🧹 Step 2: Cleaning Things Up

🔍 Step 3: Finding What Really Affects AQI

🎯 Step 4: Feature Selection

🧪 Step 5: Train-Test Split & Scaling

🤖 Step 6: Time to Train Our Models!

📊 Step 7: How Well Did They Do?

Comparative Analysis: Linear Regression vs Random Forest

⚖️ When to Use What?

💡 Final Thoughts

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement