Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

Enhancing The Robustness of Regression Model with Time-Series Analysis — Part 1
Data Science   Latest   Machine Learning

Enhancing The Robustness of Regression Model with Time-Series Analysis — Part 1

Last Updated on November 5, 2023 by Editorial Team

Author(s): Mirza Anandita

Originally published on Towards AI.

Enhancing The Robustness of Regression Model with Time-Series Analysis — Part 1

A case study on Singapore’s HDB resale prices. In this story, we demonstrate how to incorporate time-series analysis into linear regression, improving the model’s predictive power.

Photo by Muhammad Faiz Zulkeflee on Unsplash

Introduction

Located 1.5 hours away from home, Singapore always fascinates me. Surrounded by much larger neighbors, the tiny country has beaten the odds. From a humble beginning upon independence, it has now emerged as an important financial hub and international port. Despite its success, Singapore faces a major challenge — land scarcity. This issue has led to remarkably high housing prices.

Anticipating this issue, the Singaporean government strives to provide adequate housing for its residents by establishing the Housing Development Board (HDB) in 1960. Through the agency, the government has succeeded in providing quality, sanitary, and affordable public housing for more than 80% of its population, according to HDB’s official website.

Recognizing the importance of HDB, in this blog we will delve deep to understand Singapore’s HDB resale prices based on a publicly available dataset using data-driven approaches. This dataset is intriguing due to its potential to build a regression model out of it, given its abundance of information from resale prices and related variables.

Surprisingly, out of all of the excellent analyses I could find in Medium and Kaggle, few people, if any, have deeply addressed the time-series nature of this dataset. This void of analysis has challenged me to address this issue.

Therefore, in this story, our goal is to seamlessly incorporate time-series analysis and forecasting into regression. With that in mind, hopefully this perspective can also add fresh insights and improve the robustness of existing models.

Tools

The programming language I used to perform the analysis is Python, along with, but not limited to, the following libraries:

import pandas as pd # structured data manipulation
import numpy as np # scientific and numerical computing
import scipy # scientific and numerical computing
import matplotlib.pyplot as plt # data visualisation
import seaborn as sns # data visualisation
import statsmodels.api as sm # statistical analysis
import datetime as dt # manipulating date and time data

Data Extraction, Cleaning, and Preprocessing

(For the complete codes of the end-to-end processes of this analysis, from data extraction to building regression models, feel free to check my GitHub repositories here)

The HDB resale price dataset, publicly available and free to use, can be downloaded from this website. Once we have obtained the CSV file, we could then load the data into Pandas DataFrame, resulting in the following table:

# Load the dataset into dataframe 
# This is where I downloaded the dataset: https://data.gov.sg/dataset/resale-flat-prices
# I picked the time frame between Jan-2017 and Aug-2023

data = pd.read_csv('Housing_Resale_Dataset.csv')

data.head()
Raw Dataset — Image by Author

The raw dataset consists of 160,795 rows (number of transactions) and 11 columns (number of variables). Since we focus on predicting HDB resale prices, the column ‘resale_price’ becomes the dependent variable, or target, while the rest become independent variables, or features. While the dataset was already relatively neat, we performed some data scrubbing and preprocessing, including column renaming, extraction of new features, removal of duplicates, and rearrangement to obtain the final clean dataset that appears like this:

cleandata.head()
Cleaned Dataset — Image by Author

From the image above, we see that we can extract the information from the column ‘resale_date’ (previously named ‘month’ and not visible in the image) to create two new features, ‘resale_year’ and ‘resale_month’ so we get 13 columns in the clean dataset. At the same time, the number of rows decreased slightly to 160,454, a result of duplicate removal.

Exploratory Data Analysis

Next, we will create visualizations to uncover some of the most important information in our data. While I created the visuals mainly in Jupyter Notebook using Matplotlib and Seaborn for direct analysis and its flexibility, in this part of the blog I also use images generated by Tableau for a polished and reader-friendly presentation.

Price Distribution

First thing first, let’s take a look at the HDB resale price from the histogram below:

Price Distribution — Image by Author
cleandata['resale_price'].describe()
Summary of Descriptive Statistics — Image by Author

With a bin size of 40,000, we observe that ‘resale_price’ distribution is skewed to the right, with mean S$486,519 and median S$455,000.

Average Price by Town

Using a bar chart, we discover the average price of flats in different towns. Purple bars highlight the five most expensive towns in terms of the average price of HDB flats. From this figure, you will also notice that there are some significant variations among towns when it comes to average prices.

Average Price by Town — Image by Author

Monthly Average Price

Since our main goal in this blog is to showcase how time-series analysis can enhance regression models, it is imperative to show the time-series aspect of the data visualizations. Therefore, below is the monthly average price of HDB flats from January 2017 to August 2023. Note that the blue dashed line represents the actual monthly average price, while the purple line represents the 6-month moving average.

Monthly Average Price — Image by Author

If we look closely at the blue line, we see the zig-zag pattern, which might be an indication that our dataset exhibits monthly seasonality. On the other hand, the purple line shows the trend of the data. From the purple line, we can see that our dataset has a non-stationary positive trend, where the monthly average price dropped in mid-2018 and stabilized at the beginning of 2019, but then the average price started to spike in mid-2020.

Monthly Transactions

The image below shows the monthly transactions from January 2017 to August 2023. The graph also shows that the transaction data exhibits seasonality, where around December and January, the monthly transactions usually drop. We can also see an anomaly where the transaction dropped precipitously in the first quarter of the year 2020. It is safe to assume that this sharp drop was caused by Covid-19 lockdown.

Monthly Transactions — Image by Author

Time-Series Analysis

Photo by Jake Hills on Unsplash

As mentioned in the introduction, our focus in this story is to understand the pricing of HDB flats. Thus, as a prerequisite for time-series analysis, first, we will construct our time-series dataset by creating a Pandas series with monthly average prices, using ‘resale_date’ as the index.

price_time = cleandata.groupby('resale_date')['resale_price'].mean()
price_time.head()
Monthly Average Price — Image by Author

Decomposition

From the Monthly Average Price visualization we observed the trend and potentially the seasonality of the time-series aspect of the data. So next, what we would like to do is break down the data into its constituent components, i.e., trend, seasonality, and residual, using Time-series Decomposition.

import statsmodels.api as sm

# We use 'additive' model as the seasonality seemingly doesn't magnify over time
# we added 6 periods of extrapolation so the trend extends to the first and last 6 months

price_decomp = sm.tsa.seasonal_decompose(price_time, model = 'additive', extrapolate_trend=6)
price_decomp.plot();
Time-series decomposition — Image by Author

Through a simple moving average, we already saw the trend of our time-series data. Meanwhile, by performing decomposition, we can also tell that the data does have seasonality within a 12-month cycle. From the graph above, it is apparent that the seasonality is small relative to the mean price, with a range between -4,000 and 2,500. To make it easier to see, we will isolate the seasonality graph and change the line plot into a bar chart:

price_season = price_decomp.seasonal # seasonality 
price_season_month = price_season.reset_index()
price_season_month['resale_month'] = price_season_month['resale_date'].dt.strftime('%b') # extract months information

months_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

price_season_month['resale_month'] = pd.Categorical(price_season_month['resale_month'], categories= months_order, ordered = True) # create months order on data
pm = price_season_month.groupby('resale_month')['seasonal'].mean()
pm.plot(kind = 'bar')
Seasonality of monthly average price — Image by Author

Data Splitting

Before building a model that can be evaluated for its reliability, we employ a method referred to as data splitting. This process involves partitioning the data into two parts — training dataset and testing dataset. We use the training dataset to build the time-series model, and then use the testing dataset to evaluate the model’s performance. Thus, in the context of time-series modeling, the training dataset acts as “the past”, while the testing dataset acts as “the future”.

# we use log transformation because the model performs better this way
transform = np.log
price_time = price_time.apply(transform)

# Train-Test Split
price_test = price_time['Sep 2022':] # testing dataset (last 12 months)
price_train = price_time[:'Aug 2022'] # training dataset (preceding data)

For our dataset, the timeframe spans from January 2017 to August 2023. In our case, we select the last 12 months of the data as the testing dataset and the preceding data as the training dataset. One might think about the reasoning behind choosing 12 months. Although this selection can be seen as somewhat arbitrary, it serves a practical purpose. A 12-month period provides a long enough window for us to observe the behaviour of our prediction. On the other hand, that period is short enough for our prediction not to diverge too much from the actual data.

Time-Series Modelling

From the information we obtained previously, we could then build a time-series model. Eventually, our goal is to leverage our understanding of the model to predict how our dataset might behave in the future. This process is referred to as forecasting.

There are many different time-series modelling. Classical methods include moving average, exponential smoothing, and ARIMA (Autoregressive Integrated Moving Average), while more advanced techniques include Facebook Prophet and LSTM Networks. In this particular context, the classical ways works fairly well, and among those methods, SARIMA (Seasonal Autoregressive Integrated Moving Average), an extension of ARIMA, is the best-performing one.

To build a SARIMA model, we need to determine our parameters, which are p , d , and q of the ARIMA model and P, D , Q, and s of the seasonality. An explanation for the parameters can be read here and here. Although there are more rigorous ways to determine the parameters, in this particular instance, we use the pragmatic approach of grid search and choose the model based on the lowest resulting AIC (Akaike Information Criterion) score.

from statsmodels.tsa.statespace.sarimax import SARIMAX 
# import SARIMA function from statsmodels

s = 12 # the seasonality is made up of a 12-month cycle

for p in range(1, 4):
for d in range(0, 3):
for q in range(0, 4):
for P in range(1, 4):
for D in range(0, 3):
for Q in range(0, 4):
model = SARIMAX(price_train, order=(p, d, q), seasonal_order=(P, D, Q, s))
results = model.fit()
print(f’ARIMA{p},{d},{q}-SARIMA{P},{D},{Q}-{s} - AIC: {results.aic}’)

# note that running grid search can be computationally expensive and take some time
# so turn the codes off once you have obtained the desired parameters

For this data, our optimum parameters are 3, 1, and 3 for the ARIMA parameters and 1,1,3, and 12 for the seasonality parameters. In the code block below, we can see how to fit the training dataset into a model using the library Statsmodels.

p,d,q = 3,1,3
P,D,Q,s = 1,1,3,12
sarima_model = SARIMAX(price_train, order=(p, d, q), seasonal_order=(P, D, Q, s))
sarima_result = sarima_model.fit()

Forecasting

After we obtain the model, the next thing we need to do is to check the residual plot. As shown in the image below, for some reasons that I still cannot explain, the residual values at the first and 13th index always diverge extremely far from zero, no matter how I adjust the model’s parameters. This issue never resolves despite my attempts.

sarima_result.resid.plot()
Residual plot of the model — Image by Author

Nevertheless, the rest of the data behaves normally. Therefore, to address this issue, I decided to analyse the model using data beyond the 13th index. This pragmatic approach has allowed me to work with the more stable part of the dataset and avoid the impact of extreme outliers.

sarima_result.resid[13:].plot()
Residual plot of the model, without outliers — Image by Author

So, after building the SARIMA model, the next step is to create prediction, on both training and testing dataset, and then we plot them together with the actual dataset so as to make visual evaluation.

Actual vs Prediction, Time-series analysis — Image by Author

Model Evaluation

From the visual above, we can see remarkable similarities between the prediction, coming from both the training and testing datasets, and the actual values. The prediction also follows the actual values closely in terms of pattern and trend. Consequently, upon visual examination, we can conclude that our SARIMA model can predict data in the training dataset well. Moreover, this accuracy also extends to the testing dataset. This means that the model can generalize to unseen data reasonably well, at least in a short period of time (in our case, a 12-month period).

In addition to visual examination, we can also use certain metrics to assess the accuracy of our time-series model. One of those metrics is the mean absolute error (MAE). We calculate MAE by taking the absolute value of the average difference between predicted values and true values in our dataset. In essence, MAE measures how closely, on average, our model aligns with the true values. A lower MAE suggests that our model’s predictions are close to the actual data, while a higher MAE means that our model exhibits larger errors. To calculate MAE, we can easily use the function provided by the library Scikit-learn.

from sklearn.metrics import mean_absolute_error as MAE

# do not forget to reverse our log transformation
MAE_train = MAE(price_train[13:].apply(reverse_transform),train_predict[13:].apply(reverse_transform))
MAE_test = MAE(price_test.apply(reverse_transform),test_predict.apply(reverse_transform))

print(f'MAE of training dataset = {MAE_train:.2f}')
print(f'MAE of testing dataset = {MAE_test:.2f}')
Mean Absolute Errors — Image by Author

From the calculation above, we got the MAEs for the training dataset and testing dataset at S$5748 and S$7549, respectively. To put the MAEs into perspective, we can compare these values with the average of our time-series data, standing at S$481,070. This means that the MAE values are pretty small compared to the average monthly price. These values indicate that our model exhibits only minor deviations from the true values.

Furthermore, the MAE for the testing dataset is only slightly higher than that of the training dataset. This minuscule difference further confirms that our time-series model can generalize well to unseen data, again, at least in short-term periods.

To be continued…

Finally, as we have completed our time-series model, we are ready to get into the second part of this story. In the next segment, I will show you how to leverage the insights we have gained from time-series analysis and use them to improve the robustness of the regression model.

Further Reading:

[1] Housing Development Board, HDB History and Towns, 2023. [Online]. Available: https://www.hdb.gov.sg/cs/infoweb/about-us/history [Accessed: October 30th, 2023]

[2] J. Brownlee, How to Create an ARIMA Model for Time Series Forecasting in Python, 2020 . [Online]. Available: https://machinelearningmastery.com/arima-for-time-series-forecasting-with-python/ [Accessed: October 30th, 2023]

[3] J. Brownlee, A Gentle Introduction to SARIMA for Time Series Forecasting in Python, 2019 . [Online]. Available: https://machinelearningmastery.com/sarima-for-time-series-forecasting-in-python/ [Accessed: October 30th, 2023]

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓