Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-FranΓ§ois Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Can Machine Learning Predict Air Quality Before It Gets Dangerous?
Latest   Machine Learning

Can Machine Learning Predict Air Quality Before It Gets Dangerous?

Last Updated on April 26, 2025 by Editorial Team

Author(s): S Aishwarya

Originally published on Towards AI.

Can Machine Learning Predict Air Quality Before It Gets Dangerous?

In recent years, air pollution has moved from being an abstract environmental issue to a daily reality for millions around the globe. From smog-filled cities to alarming health advisories, poor air quality affects how we live, breathe, and move.

But what if we could see it coming?

What if we could predict tomorrow’s Air Quality Index (AQI) just like we predict the weather β€” and take action before the air turns toxic?

That’s where machine learning steps in.

Photo by Amir Hosseini on Unsplash

🌍 Why Should We Predict Air Quality?

Air pollution isn’t just an inconvenience β€” it’s a health crisis. According to the WHO, it’s responsible for nearly 7 million premature deaths each year. High AQI levels are associated with asthma attacks, respiratory diseases, cardiovascular issues, and even cognitive decline.

If we can predict AQI accurately:

  • 🚨 Communities can receive early warnings.
  • 🏙️ Urban planners can design smarter, cleaner cities.
  • 🧍‍♂️ Individuals can decide when it’s safe to go for a jog or send kids to play outside.

So the million-dollar question is:

Can machine learning help us predict air quality in time to protect our health?

Let’s find out.

Dataset Overview

🌫️ Air Quality Data in India (2015–2020)

📌 Link: DATASET

📝 Overview

This dataset contains daily air quality data from major cities across India, collected between 2015 and 2020. It includes concentrations of various pollutants, meteorological parameters, and calculated AQI values.

📁 Files Included

  • city_day.csv: Daily air quality data per city
  • station_day.csv: Daily air quality data per station
  • stations.csv: Metadata for each monitoring station

we are going to use city_day.csv

Each row in this file corresponds to a single day of air quality data for a specific city.

Image by Author

Absolutely! Here’s a more engaging, human-like rewrite of the article with a storytelling vibe β€” perfect for blog posts or platforms like Medium or GeeksforGeeks:

🧠 Our Approach: Classical ML vs Ensemble Learning

Let’s walk through a complete ML workflow using Python to forecast AQI, comparing two models: Linear Regression and Random Forest Regressor.

🔹 Linear Regression β€” simple, fast, and interpretable.
🔹 Random Forest Regressor β€” powerful, robust, and accurate.

Step 1: Loading the Data

We’re using the Air Quality in India dataset from Kaggle, which tracks pollutant levels and AQI values across Indian cities from 2015 to 2020.

import pandas as pd
df = pd.read_csv("/content/city_day.csv")

This dataset includes measurements for PM2.5, PM10, NOβ‚‚, CO, and other major pollutants β€” basically, the stuff that’s floating around in the air we breathe.

🧹 Step 2: Cleaning Things Up

Like most real-world data, this one’s a bit messy. So, let’s tidy it up:

df = df.dropna(subset=['AQI']) # Drop rows with missing target
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df.fillna(df.median(numeric_only=True), inplace=True)
df = df.drop(columns=['Date', 'City', 'AQI_Bucket'])

Dropping missing AQI values
Filling gaps with column medians
Removing columns we won’t use (like city names or categories)

🔍 Step 3: Finding What Really Affects AQI

Now it’s time to explore β€” what’s actually influencing air quality?

import seaborn as sns
import matplotlib.pyplot as plt
correlation_matrix = df.corr()
plt.figure(figsize=(15, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap with AQI")
plt.show()
Image by Author

🧪 Spoiler alert: PM2.5, PM10, NOβ‚‚, NO, CO β€” these are some of the biggest troublemakers.

🎯 Step 4: Feature Selection

We’ll focus on the top pollutants that have a high correlation with AQI.

selected_features = ['CO', 'PM2.5', 'NO2', 'SO2', 'NOx', 'PM10', 'NO']
X = df[selected_features]
y = df['AQI']

This helps keep the model lean and focused.

🧪 Step 5: Train-Test Split & Scaling

Let’s prep our data for the ML models:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

🔄 We split our data into training and testing sets
📏 We scale the features so that models like Linear Regression don’t get confused by large numbers

🤖 Step 6: Time to Train Our Models!

Let’s see what our two models can do.

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
lr_preds = lr.predict(X_test_scaled)
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
rf_preds = rf.predict(X_test)

📊 Step 7: How Well Did They Do?

Let’s measure how close the predictions are to the actual AQI values.

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
def evaluate_model(name, y_true, y_pred):
print(f"\n{name} Results:")
print(f"MAE: {mean_absolute_error(y_true, y_pred):.2f}")
mse = mean_squared_error(y_true, y_pred)
print(f"MSE: {mse:.2f}")
print(f"RMSE: {np.sqrt(mse):.2f}")
print(f"R2 Score: {r2_score(y_true, y_pred):.2f}")

plt.figure(figsize=(8, 5))
sns.scatterplot(x=y_true, y=y_pred, alpha=0.4)
plt.plot([y_true.min(), y_true.max()], [y_true.min(), y_true.max()], 'r--')
plt.xlabel("Actual AQI")
plt.ylabel("Predicted AQI")
plt.title(f"{name} - Actual vs Predicted AQI")
plt.grid(True)
plt.show()
evaluate_model("Linear Regression", y_test, lr_preds)
evaluate_model("Random Forest", y_test, rf_preds)
Image by Author
Image by Author

Comparative Analysis: Linear Regression vs Random Forest

Image by Author

⚖️ When to Use What?

Choose Linear Regression if:

  • You want interpretable results
  • Your dataset is small or clean
  • You want a fast, lightweight model

Choose Random Forest if:

  • You need high accuracy
  • Your data has non-linear relationships
  • You’re okay with a black-box approach

💡 Final Thoughts

Machine learning isn’t just about numbers on a screen β€” it’s about unlocking insights that can change lives.

By harnessing environmental data, we’re not only predicting air quality β€” we’re empowering people to act before it becomes dangerous. Whether it’s helping parents decide if it’s safe for their kids to play outside or aiding city planners in reducing pollution hotspots, every insight brings us one step closer to healthier communities.

From simple, transparent models like Linear Regression to the high-performing Random Forests, we’ve seen how different algorithms can serve different needs. But in the end, the goal is the same:

Better data β†’ Smarter decisions β†’ Cleaner air.

As we look ahead, imagine combining this with real-time sensor feeds, satellite data, or weather forecasts. The possibilities are vast β€” and they start with projects just like this.

Let’s keep innovating β€” one prediction, one breath, and one line of code at a time. 🌿💻🌍

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓