Preparing time-series to build a Pollution Forecasting Model with Python
Last Updated on January 29, 2024 by Editorial Team
Author(s): Fabiana Clemente
Originally published on Towards AI.
A step-by-step guide
This article was co-authored by Ian Spektor, Lead Machine Learning Engineer @ Tryolabs, who is passionate about data-science and time-series data.
The adoption of AI and the application of Machine Learning is one of the most strategic initiatives from many different organizations and verticals. With the rapid technological changes as well as the ever-increasing influx of data, the need to be able to foresee and adapt to changing circumstances has become a huge advantage for businesses. For that reason, it is so crucial for data practitioners to not only be able to develop and train forecasting models but also to build effective forecasting systems!
Based on existing historical data, time-series forecasting is an important skill to master in the data science field. Whether itβs predicting stock prices, anticipating consumer behavior, optimizing supply chains, or even addressing environmental concerns, the application of time-series forecasting empowers organizations and individuals to navigate uncertainties with foresight and precision.
However, to build a successful forecasting model, it is crucial to fully understand the characteristics of our time series data as well as optimize its preparation depending on the forecasting model to be used and the end-goal application. These concerns bring us to the topic of this blog post β a step-by-step tutorial on how to prepare a pollution dataset to build a forecasting model on.
But analyzing and understanding your time-series data might not only be time-consuming, but also, finding and fixing these issues to get the juicy insights can turn into a brain maze! In this blog post, we will cover how you can automate and interpret your time-series data exploratory data analysis with ydata-profiling, as well as easily manipulate and transform your data with Temporian, your pandas alternative for time-series. The dataset used in this tutorial is the Pollution Dataset (License DbCL v1.0), which encompasses the measurements of four major pollutants across the United States spanning the years 2000 to 2016.
The tooling
ydata-profiling
When it comes to understanding time-series data, there are many questions that might come to our minds:
- Do we have several variables that are time-dependent?
- Are there several entities? For example, do we have one time series for each city of our data?
- Is my data stationary?
β¦ and the list goes on. Truth be told, there are simply too many questions that we need to address in terms of time series, and this can become time-consuming β especially if we have to repeat this analysis for every single iteration of our data preparation!
ydata-profiling has been used to address automated data profiling, but it also offers you the capability to automate your time-series analysis. Besides providing the normal statistics and visualizations when it comes to EDA (histograms, min, max, standard deviation, cardinality, etc.) it also empowers data science teams with needed metrics to better understand data that has an underlying relation with time.
With a single line of code, youβre able to get a full overview of your time-series characteristics, including data distributions, existing interactions, and correlations, while also being alerted for potential data quality issues in your data, from non-stationarity to seasonality, and even missing data gaps.
Temporian
Once youβve understood your data, itβs time to buckle down and start preparing it to be consumed by your forecasting model.
Preparing the data involves cleaning the data (for example, by removing or filling gaps and trimming or removing outliers), transforming it if needed (for example, by normalizing it, resampling it, or detrending it), and performing feature engineering.
Feature engineering β i.e., creating new features from our raw data to be fed to our model β is one of the most important steps in any machine-learning pipeline. Through it, we can transform our raw data into something that our model can consume and make sense of, for example, by creating new features that expose non-linear relationships between existing ones or that contain exogenous or static data that helps explain the target variable.
Traditionally, the most common approach has been to make use of a generic data preprocessing library, such as pandas or polars, for these tasks. These tools and their DataFrame or table-like APIs are geared toward tabular data but generic enough to be usable with time series data, too.
However, βusableβ here wonβt cut it β when working with large time series datasets, pandas can not only be very slow, but also hard to write, and even harder to debug! Thatβs why in this tutorial weβll be using Temporian, a new open-source library custom-made for safe, simple, and efficient preprocessing and feature engineering of temporal data.
Temporian replaces the DataFrame with the EventSet, a data structure that supports not only time series data but also multivariate, multi-index, and non-uniformly sampled temporal data β such as that coming from several sensors, in several locations, being sampled at different rates! On top of that, Temporianβs operations run orders of magnitude faster than its alternatives, and itβs extremely easy to integrate into any existing ML pipeline.
The tutorial
Understanding the data
Before getting our hands on the data exploration, we first need to install the ydata-profiling package:
pip install ydata-profiling
Then, we can start hypothesizing about the data and some particular concerns: βAre there missing values in the dataset? Does the data exhibit specific trends or irregularities? Are there calendar-related features (day, week, month, holidays) that should be considered?β
To start answering these questions, we can load the data and import the ProfileReport to produce a comprehensive report of all the characteristics of our dataset. To understand the dataset, and given that our dataset includes information for multiple sites across different states, counties, and cities, for each one, we will need to generate a profiling. Below, you can see an example of how you can profile the data of the site in Scottsdale, Arizona:
import pandas as pd
from ydata_profiling import ProfileReport
df = pd.read_csv("pollution_us_2000_2016.csv", index_col=[0])
# Select data from Arizona, Maricopa, Scottsdale (Site Num: 3003)
df_site = df[df['Site Num'] == 3003].reset_index(drop=True)
# Change 'Data Local' to datetime
df_site['Date Local'] = pd.to_datetime(df_site['Date Local'])
Note that, according to the documentation, weβll need to pass the parameter `tsmode=True` to profile time series data:
# Create the Profile Report
report = ProfileReport(df_site, tsmode=True, sortby="Date Local")
report.to_file('report_scottsdale.html')
The report outputs several informative details of the data, starting with a detailed overview:
We can immediately spot that this dataset has 28 features, 7840 observations/records, and 3.6% of missing values overall. These missing values may indicate particular failures in sensors happening across all of the time series measurements or only for specific measurements. Although they might be worthy of investigation to understand the cause and fix the issue (e.g., replace a sensor in the Scottsdale station), given their low percentage (< 5%), we can also consider data imputation instead to infer the missing values and make sure our dataset is complete for analysis. Finally, out of 28 features, 14 are time series data, which we can investigate more deeply in the βTime Seriesβ tab:
We can now determine that the 14 time series correspond to 4 measurements (Mean, 1st Max Value, 1st Max Hour, and AQI, which stands for Air Quality Index) taken from 3 main pollutants (NO2, O3, SO2), and additionally CO Mean and CO AQI. Furthermore:
- There are 7840 recorded measurements, corresponding to 9 years of data from January 2000 to December 2009;
- For each one of the observed dates in our dataset, there are 2 to 4 registered events (depending on the pollutant) with different values when it comes to pollutant measurements. According to the USA EPA, the presence of several records for the same date is due to the presence of more than one monitor at the same location for the same pollutant.
After exploring the general data descriptors, we can zoom in on specific features to further investigate their behavior, referring to their visualizations or the interactions and correlation plots. From here, we can find insightful relationships between features that help us select the best set of predictors for a particular task:
The analysis of the identified potential data quality issues is also a fundamental task. This will help us determine the necessary adjustments during data preparation and feature engineering process. Letβs investigate the SO2 Mean as an example:
The automatic detection of data quality issues highlights four main concerns:
Β· High Correlation: SO2 is highly correlated with other features in the data, namely SO2 1st Max Value and SO2 AQI. Having highly correlated features in data makes it more difficult for the forecasting model to estimate the individual effect of each feature and leads to unstable predictions. For that reason, we might want to discard some of these features;
Β· Non-Stationarity and Seasonality: SO2 exhibits a non-stationary behavior as well as seasonal trends, which makes the forecasting task more complex. Some forecasting models assume stationarity, and violations of this assumption can lead to inaccurate predictions. The same is valid for seasonality: failing to account for these periodic fluctuations can lead to the underestimation of future values, and the model wonβt be able to provide reliable predictions during specific periods (e.g., peak seasons or recurring events). The combination of these characteristics requires the application of appropriate feature engineering techniques (e.g., transforms, decomposition, filtering) or suitable models that can internally handle these issues internally (e.g., ARIMA models, LSTMs);
Β· Zeros: Zeros in such a high number of records are not expected for the pollutants being measured in the data (it is unlikely to have a zero level of pollution even in best-case scenarios). They are most likely errors happening during sensor reading or data transmission.
Additionally, from the line plot visualization of the data, there seem to be some gaps in the data, which seem to be repetitive. Looking into the βGap Analysisβ, we can confirm that there are 23 gaps that seem to be cyclic: a deeper investigation would reveal that in most years, the data has not been collected between May to August. This helps us further account for the seasonality patterns of the data and decide whether to keep the intervals, impute the data, or segment the stream to develop our forecasting model.
A comprehensive data profiling sets the tone for an efficient data preparation, which we will explore in the following step!
Time-Series Data Preparation with Temporian
As explained in the introduction, Temporian substitutes the generic DataFrame with the EventSet, a data structure tailor-made for temporal data.
An EventSet is, as its name indicates, a set of events. Each event consists of a timestamp (i.e., the time the event happened) and a set of feature values. In our dataset, for example, the timestamp will be the value of βDate Localβ, and each event will have associated the measured values for each pollutant on that date.
Note that this allows modeling for daily time series (as is the case for this dataset), but also for non-uniformly sampled or even asynchronous events, since thereβs no frequency restriction on an EventSetβs timestamps.
One other important concept to touch on before we create our EventSet is that of indexes. Time series datasets tend to contain data for several entities, as is our case with data from several βSite Numsβ, which at the same time belong to several cities, counties, and states β this is known as a hierarchically structured dataset, since the entities follow a (multi-level) hierarchical order.
Temporian allows specifying indexes for an EventSet, which is a list of columns whose values identify different entities. When operating on an indexed EventSet, each entityβs data is treated independently from the others. Additionally, Temporian makes it easy to move up and down the hierarchical structure, for example,, in this case,, allowing to compute the moving average of βNO2 AQIβ at the βStateβ level and use that as a feature at the βSite Numβ level.
Without further ado, letβs create our EventSet, with βDate Localβ as our timestamps, and βSite Numβ, βCityβ, βCountyβ and βStateβ as our index. Weβll drop all other columns that arenβt pollutant measurements.
Note that weβre using the full dataset, with the data from all sites, not just Scottsdale.
import temporian as tp
df_numerical = df.drop(columns=['State Code', 'County Code', 'Address', 'NO2 Units', 'SO2 Units', 'CO Units', 'O3 Units'])
evset = tp.from_pandas(df_numerical, timestamps='Date Local', indexes=['State', 'County', 'City', 'Site Num']).cast(float)
evset
Displaying our EventSet yields some information weβve already learned during the profiling stage, such as the number of features and their types, and the number of events in our dataset.
Note that the data from each of the 204 index values (i.e., each Site Num and its corresponding State, County and City), is displayed independently.
Letβs plot the EventSet.
evset.plot()
The plot is a bit crowded, though we can see that there are gaps, as we had discovered during our profiling. However, this Site seems to be missing data in different time spans as the one in Scottsdale did.
The missing periods could be a problem if working with a framework that assumes the data is a time series (i.e., uniformly sampled), which could end up feeding tons of data from the empty periods to our model. However, this isnβt the case with Temporian β computation will only happen in the periods where data exists.
We can try to zoom in at a specific time to take a closer look at the data. For this, weβll define two sample dates that weβll reuse later on in this notebook when needing to zoom in again.
Temporianβs .before() and .after() operators come in handy for this task.
from datetime import datetime
example_date_from = datetime(2015, 7, 1)
example_date_to = datetime(2015, 9, 1)
evset.after(example_date_from).before(example_date_to).plot()p
There doesnβt seem to be much structure to the series, at least visually. Some do seem to be following a slight trend.
Temporian automatically chooses between a line or a markers plot, depending on whether the data is a time series (uniformly sampled) or not. Seeing markers in the plots means we are dealing with non-uniformly sampled data.
Not only that β but as we had discovered during our initial analysis, it is visible that there are repeated dates in the data, i.e., pollutant values for the same day in the same site. Looking at the plots, we see that these repeated values tend to be close to the other ones on the same day, so weβll aggregate them by averaging them together.
Note that Temporian ignores missing values when averaging, so itβs safe to do so even if thatβs our case.
The only features that donβt always take a similar value seem to be the βX 1st Max Hourβ ones, where the values for one same day can be some very close to 0 and some very close to 24. This occurs because the hour of the day is a cyclical feature, which means that the values 0 and 23, which are very far apart in the integer world, are very close in reality! (11:00 p.m. one day, and 00:00 a.m. of the following day).
If we were to average 23 and 0 together, the result would be close to 12 β which is the exact opposite of the real value. To handle this, weβll first encode these cyclical features using this technique, which uses sine and cosine to map the feature onto a unit circle, preserving cyclic relationships.
import numpy as np
# Encoding for cyclical features, like hour of day
def encode_hour_cos(value: float):
return np.cos((value * 2 * np.pi) / 24)
def encode_hour_sin(value: float):
return np.sin((value * 2 * np.pi) / 24)
# Visualize sin/cos value for each hour in 24
hours = list(range(0, 24))
x = list(map(encode_hour_cos, hours))
y = list(map(encode_hour_sin, hours))
df = pd.DataFrame({'x': x, 'y': y})
df.plot.scatter('x', 'y')
We can see how mapping the hour of day to this new space makes values that are next to each other be next to each other but also keep 23 and 0 next to each other too.
Now that weβve checked this works as expected, we can go ahead and encode one of the β1st Max Hourβ features with it and see the results.
# Encode a Max Hour column
co_1st_max_hour = evset['CO 1st Max Hour']
co_1st_max_hour_cos = co_1st_max_hour.map(encode_hour_cos).prefix('Cos')
co_1st_max_hour_sin = co_1st_max_hour.map(encode_hour_sin).prefix('Sin')
sincos = tp.glue(co_1st_max_hour_cos, co_1st_max_hour_sin)
# See the result
sincos.after(example_date_from).before(example_date_to).plot()
That looks alright β all values are now between -1 and 1. Time to aggregate all the same-day records by averaging the values.
With Temporian, this can be done by computing a moving average with an almost-zero window length, sampled once per each unique timestamp in our original data.
# Aggregate same-day same-site records by averaging their values
def aggregate_avg(evset: tp.EventSet, sampling: tp.EventSet):
return evset.simple_moving_average(window_length=tp.duration.shortest, sampling=sampling)
sincos_avgd = aggregate_avg(sincos, sincos.unique_timestamps())
# See the result when applying the aggregation to the CO2 1st Max Hour sin/cos features we just created
sincos_avgd.after(example_date_from).before(example_date_to).plot()
Note that some plots are still displaying with marks instead of lines β this can be because of missing dates in the data, which makes it non-uniform, but it wonβt affect any of the computation that follows.
Now that weβve cleaned our raw data, itβs time to create some rich features to feed our model.
Weβll be creating the following features:
- 1-day and 7-day lagged values of each raw feature
- 1-month moving average of each feature
- 1-month moving average of each feature at the State level (i.e., aggregating values from all sites in the same State)
- Calendar information
Letβs kick off with the lags, using the .lag() operator. Note that weβll be using .prefix() to distinguish these featuresβ names from the original ones, and .glue() to concatenate together the created features.
# Generate 1- and 7-day lags for each feature
lagged = []
for days in [1, 7]:
lagged.append(
evset_avgd.lag(tp.duration.days(days)).resample(evset_avgd).prefix(f'Lag {days}d ')
)
lags = tp.glue(*lagged)
# Take a look at the generated lags for a single feature
tp.glue(evset_avgd['O3 Mean'], lags[['Lag 1d O3 Mean', 'Lag 7d O3 Mean']]).after(example_date_from).before(example_date_to).plot()
We can see the series move towards the right as we lag it forward in time, as expected.
Now the 1-month moving average, which can be generated using .simple_moving_average() with a duration of tp.duration.weeks(4)
# Generate 1-month moving average
moving_avg = evset_avgd.simple_moving_average(tp.duration.weeks(4)).prefix('MovAvg 1M ')
# Take a look at the generated movavg for a single feature
tp.glue(evset_avgd['O3 Mean'], moving_avg['MovAvg 1M O3 Mean']).plot()
Averaging the series drowns out most of the noise, yielding a much steadier signal. Note that the average series still shows strong monthly seasonality β if we wanted to capture the underlying yearly trend of the data, weβd probably need an average with a much larger window.
The State-level average is computed in almost the exact same way, only needing to reindex the data to the State level before computing it, and then using .propagate() to broadcast the State-level result to each Site num in that State.
# Generate 1-month moving average on the State level
evset_state = evset_avgd.drop_index(['County', 'City', 'Site Num'], keep=False) # Keep only State as index
moving_avg_state = evset_state.simple_moving_average(tp.duration.weeks(4)).prefix('MovAvg State 1M ')
moving_avg_state_prop = moving_avg_state.propagate(evset_avgd, resample=True) # Propagate (broadcast) state-level feature to the original Site num index
# Take a look at the generated moving avg for a single feature
tp.glue(evset_avgd['O3 Mean'], moving_avg_state_prop['MovAvg State 1M O3 Mean']).plot()
The State-level average looks very close to the Site-level series β probably an indicator that sites in the same state follow a common pattern. The State-level moving average will most likely be an even steadier indicator of the current status of the pollutants.
Finally, for the model to be able to capture the monthly seasonality we see in some of the plots above, we can use the calendar functions to add the year, month, day of month, and day of week each event happens on as features.
# Generate calendar info: year, month, day of month, day of week
calendar = tp.glue(
evset_avgd.calendar_year(),
evset_avgd.calendar_month(),
evset_avgd.calendar_day_of_month(),
evset_avgd.calendar_day_of_week(),
)
# Look at the generated features
calendar.after(example_date_from).before(example_date_to).plot()py
Our daily data shows some expected patterns in the calendar features, such as a constant year (in the 2 months we are zoomed on), a step in the month, and repeating spikes in the day of the month and day of the week.
Weβre done with feature engineering β time to concatenate them all together and export them to a DataFrame that most modeling frameworks can consume.
# Concatenate all features together and export to a DataFrame
all_feats = tp.glue(
evset_avgd,
lags,
moving_avg,
moving_avg_state_prop,
calendar
)
all_feats_df = tp.to_pandas(all_feats).drop(columns='timestamp')
columns = list(all_feats_df.columns)
print("# of features:", len(columns))
print(columns)
all_feats_dfp
Conclusion
That wraps up this tutorial! So, letβs quickly recap what we did:
- Produced a comprehensive report of our time-series dataset using ydata-profilingβs ProfileReport()
- Learned how to analyze the report to get a detailed overview of our data and identify potential quality issues in it
- Created an indexed Temporian EventSet and did some preprocessing to aggregate duplicate values
- Created an array of rich features that will help a model trained on the final DataFrame learn the underlying patterns in the data
Now that youβve finished this tutorial, we encourage you to train a model on the resulting features and evaluate your results! You can find an example of training a model on a dataset prepared with Temporian in this tutorial in the docs. It could prove interesting, for example, to compare the performance of a model trained on only some of the basic features (e.g., only the lagged values) vs. one trained on all of them.
We also invite you to further explore how data profiling can help you to define a better data preparation strategy as well as build more robust models. Furthermore, deep-dive into other time-series use cases and tutorials here.
About me
Passionate for data. Thriving for the development of data quality solutions to help data scientists adopt a more data-centered perspective for AI development.
CDO @ YData U+007C LinkedIn
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI