Enhancing The Robustness of Regression Model with Time-Series Analysis — Part 1

Last Updated on November 5, 2023 by Editorial Team

Author(s): Mirza Anandita

Originally published on Towards AI.

Enhancing The Robustness of Regression Model with Time-Series Analysis — Part 1

A case study on Singapore’s HDB resale prices. In this story, we demonstrate how to incorporate time-series analysis into linear regression, improving the model’s predictive power.

Photo by Muhammad Faiz Zulkeflee on Unsplash

Introduction

Located 1.5 hours away from home, Singapore always fascinates me. Surrounded by much larger neighbors, the tiny country has beaten the odds. From a humble beginning upon independence, it has now emerged as an important financial hub and international port. Despite its success, Singapore faces a major challenge — land scarcity. This issue has led to remarkably high housing prices.

Anticipating this issue, the Singaporean government strives to provide adequate housing for its residents by establishing the Housing Development Board (HDB) in 1960. Through the agency, the government has succeeded in providing quality, sanitary, and affordable public housing for more than 80% of its population, according to HDB’s official website.

Recognizing the importance of HDB, in this blog we will delve deep to understand Singapore’s HDB resale prices based on a publicly available dataset using data-driven approaches. This dataset is intriguing due to its potential to build a regression model out of it, given its abundance of information from resale prices and related variables.

Surprisingly, out of all of the excellent analyses I could find in Medium and Kaggle, few people, if any, have deeply addressed the time-series nature of this dataset. This void of analysis has challenged me to address this issue.

Therefore, in this story, our goal is to seamlessly incorporate time-series analysis and forecasting into regression. With that in mind, hopefully this perspective can also add fresh insights and improve the robustness of existing models.

Tools

The programming language I used to perform the analysis is Python, along with, but not limited to, the following libraries:

import pandas as pd # structured data manipulation
import numpy as np # scientific and numerical computing
import scipy # scientific and numerical computing
import matplotlib.pyplot as plt # data visualisation
import seaborn as sns # data visualisation
import statsmodels.api as sm # statistical analysis
import datetime as dt # manipulating date and time data

Data Extraction, Cleaning, and Preprocessing

(For the complete codes of the end-to-end processes of this analysis, from data extraction to building regression models, feel free to check my GitHub repositories here)

The HDB resale price dataset, publicly available and free to use, can be downloaded from this website. Once we have obtained the CSV file, we could then load the data into Pandas DataFrame, resulting in the following table:

# Load the dataset into dataframe 
# This is where I downloaded the dataset: https://data.gov.sg/dataset/resale-flat-prices
# I picked the time frame between Jan-2017 and Aug-2023

data = pd.read_csv('Housing_Resale_Dataset.csv')

data.head()

The raw dataset consists of 160,795 rows (number of transactions) and 11 columns (number of variables). Since we focus on predicting HDB resale prices, the column ‘resale_price’ becomes the dependent variable, or target, while the rest become independent variables, or features. While the dataset was already relatively neat, we performed some data scrubbing and preprocessing, including column renaming, extraction of new features, removal of duplicates, and rearrangement to obtain the final clean dataset that appears like this:

cleandata.head()

From the image above, we see that we can extract the information from the column ‘resale_date’ (previously named ‘month’ and not visible in the image) to create two new features, ‘resale_year’ and ‘resale_month’ so we get 13 columns in the clean dataset. At the same time, the number of rows decreased slightly to 160,454, a result of duplicate removal.

Exploratory Data Analysis

Next, we will create visualizations to uncover some of the most important information in our data. While I created the visuals mainly in Jupyter Notebook using Matplotlib and Seaborn for direct analysis and its flexibility, in this part of the blog I also use images generated by Tableau for a polished and reader-friendly presentation.

Price Distribution

First thing first, let’s take a look at the HDB resale price from the histogram below:

cleandata['resale_price'].describe()

Summary of Descriptive Statistics — Image by Author

With a bin size of 40,000, we observe that ‘resale_price’ distribution is skewed to the right, with mean S$486,519 and median S$455,000.

Average Price by Town

Using a bar chart, we discover the average price of flats in different towns. Purple bars highlight the five most expensive towns in terms of the average price of HDB flats. From this figure, you will also notice that there are some significant variations among towns when it comes to average prices.

Monthly Average Price

Since our main goal in this blog is to showcase how time-series analysis can enhance regression models, it is imperative to show the time-series aspect of the data visualizations. Therefore, below is the monthly average price of HDB flats from January 2017 to August 2023. Note that the blue dashed line represents the actual monthly average price, while the purple line represents the 6-month moving average.

If we look closely at the blue line, we see the zig-zag pattern, which might be an indication that our dataset exhibits monthly seasonality. On the other hand, the purple line shows the trend of the data. From the purple line, we can see that our dataset has a non-stationary positive trend, where the monthly average price dropped in mid-2018 and stabilized at the beginning of 2019, but then the average price started to spike in mid-2020.

Monthly Transactions

The image below shows the monthly transactions from January 2017 to August 2023. The graph also shows that the transaction data exhibits seasonality, where around December and January, the monthly transactions usually drop. We can also see an anomaly where the transaction dropped precipitously in the first quarter of the year 2020. It is safe to assume that this sharp drop was caused by Covid-19 lockdown.

Time-Series Analysis

As mentioned in the introduction, our focus in this story is to understand the pricing of HDB flats. Thus, as a prerequisite for time-series analysis, first, we will construct our time-series dataset by creating a Pandas series with monthly average prices, using ‘resale_date’ as the index.

price_time = cleandata.groupby('resale_date')['resale_price'].mean()
price_time.head()

Decomposition

From the Monthly Average Price visualization we observed the trend and potentially the seasonality of the time-series aspect of the data. So next, what we would like to do is break down the data into its constituent components, i.e., trend, seasonality, and residual, using Time-series Decomposition.

import statsmodels.api as sm

# We use 'additive' model as the seasonality seemingly doesn't magnify over time
# we added 6 periods of extrapolation so the trend extends to the first and last 6 months

price_decomp = sm.tsa.seasonal_decompose(price_time, model = 'additive', extrapolate_trend=6) 
price_decomp.plot();

Time-series decomposition — Image by Author

Through a simple moving average, we already saw the trend of our time-series data. Meanwhile, by performing decomposition, we can also tell that the data does have seasonality within a 12-month cycle. From the graph above, it is apparent that the seasonality is small relative to the mean price, with a range between -4,000 and 2,500. To make it easier to see, we will isolate the seasonality graph and change the line plot into a bar chart:

price_season = price_decomp.seasonal # seasonality 
price_season_month = price_season.reset_index()
price_season_month['resale_month'] = price_season_month['resale_date'].dt.strftime('%b') # extract months information

months_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

price_season_month['resale_month'] = pd.Categorical(price_season_month['resale_month'], categories= months_order, ordered = True) # create months order on data
pm = price_season_month.groupby('resale_month')['seasonal'].mean()
pm.plot(kind = 'bar')

Seasonality of monthly average price — Image by Author

Data Splitting

Before building a model that can be evaluated for its reliability, we employ a method referred to as data splitting. This process involves partitioning the data into two parts — training dataset and testing dataset. We use the training dataset to build the time-series model, and then use the testing dataset to evaluate the model’s performance. Thus, in the context of time-series modeling, the training dataset acts as “the past”, while the testing dataset acts as “the future”.

# we use log transformation because the model performs better this way
transform = np.log
price_time = price_time.apply(transform)

# Train-Test Split
price_test = price_time['Sep 2022':] # testing dataset (last 12 months)
price_train = price_time[:'Aug 2022'] # training dataset (preceding data)

For our dataset, the timeframe spans from January 2017 to August 2023. In our case, we select the last 12 months of the data as the testing dataset and the preceding data as the training dataset. One might think about the reasoning behind choosing 12 months. Although this selection can be seen as somewhat arbitrary, it serves a practical purpose. A 12-month period provides a long enough window for us to observe the behaviour of our prediction. On the other hand, that period is short enough for our prediction not to diverge too much from the actual data.

Time-Series Modelling

From the information we obtained previously, we could then build a time-series model. Eventually, our goal is to leverage our understanding of the model to predict how our dataset might behave in the future. This process is referred to as forecasting.

There are many different time-series modelling. Classical methods include moving average, exponential smoothing, and ARIMA (Autoregressive Integrated Moving Average), while more advanced techniques include Facebook Prophet and LSTM Networks. In this particular context, the classical ways works fairly well, and among those methods, SARIMA (Seasonal Autoregressive Integrated Moving Average), an extension of ARIMA, is the best-performing one.

To build a SARIMA model, we need to determine our parameters, which are p , d , and q of the ARIMA model and P, D , Q, and s of the seasonality. An explanation for the parameters can be read here and here. Although there are more rigorous ways to determine the parameters, in this particular instance, we use the pragmatic approach of grid search and choose the model based on the lowest resulting AIC (Akaike Information Criterion) score.

from statsmodels.tsa.statespace.sarimax import SARIMAX 
# import SARIMA function from statsmodels

s = 12 # the seasonality is made up of a 12-month cycle

for p in range(1, 4):
 for d in range(0, 3):
 for q in range(0, 4):
 for P in range(1, 4):
 for D in range(0, 3):
 for Q in range(0, 4):
 model = SARIMAX(price_train, order=(p, d, q), seasonal_order=(P, D, Q, s))
 results = model.fit()
 print(f’ARIMA{p},{d},{q}-SARIMA{P},{D},{Q}-{s} - AIC: {results.aic}’)

# note that running grid search can be computationally expensive and take some time
# so turn the codes off once you have obtained the desired parameters

For this data, our optimum parameters are 3, 1, and 3 for the ARIMA parameters and 1,1,3, and 12 for the seasonality parameters. In the code block below, we can see how to fit the training dataset into a model using the library Statsmodels.

p,d,q = 3,1,3
P,D,Q,s = 1,1,3,12
sarima_model = SARIMAX(price_train, order=(p, d, q), seasonal_order=(P, D, Q, s))
sarima_result = sarima_model.fit()

Forecasting

After we obtain the model, the next thing we need to do is to check the residual plot. As shown in the image below, for some reasons that I still cannot explain, the residual values at the first and 13th index always diverge extremely far from zero, no matter how I adjust the model’s parameters. This issue never resolves despite my attempts.

sarima_result.resid.plot()

Residual plot of the model — Image by Author

Nevertheless, the rest of the data behaves normally. Therefore, to address this issue, I decided to analyse the model using data beyond the 13th index. This pragmatic approach has allowed me to work with the more stable part of the dataset and avoid the impact of extreme outliers.

sarima_result.resid[13:].plot()

Residual plot of the model, without outliers — Image by Author

So, after building the SARIMA model, the next step is to create prediction, on both training and testing dataset, and then we plot them together with the actual dataset so as to make visual evaluation.

Actual vs Prediction, Time-series analysis — Image by Author

Model Evaluation

From the visual above, we can see remarkable similarities between the prediction, coming from both the training and testing datasets, and the actual values. The prediction also follows the actual values closely in terms of pattern and trend. Consequently, upon visual examination, we can conclude that our SARIMA model can predict data in the training dataset well. Moreover, this accuracy also extends to the testing dataset. This means that the model can generalize to unseen data reasonably well, at least in a short period of time (in our case, a 12-month period).

In addition to visual examination, we can also use certain metrics to assess the accuracy of our time-series model. One of those metrics is the mean absolute error (MAE). We calculate MAE by taking the absolute value of the average difference between predicted values and true values in our dataset. In essence, MAE measures how closely, on average, our model aligns with the true values. A lower MAE suggests that our model’s predictions are close to the actual data, while a higher MAE means that our model exhibits larger errors. To calculate MAE, we can easily use the function provided by the library Scikit-learn.

from sklearn.metrics import mean_absolute_error as MAE

# do not forget to reverse our log transformation 
MAE_train = MAE(price_train[13:].apply(reverse_transform),train_predict[13:].apply(reverse_transform))
MAE_test = MAE(price_test.apply(reverse_transform),test_predict.apply(reverse_transform))

print(f'MAE of training dataset = {MAE_train:.2f}')
print(f'MAE of testing dataset = {MAE_test:.2f}')

From the calculation above, we got the MAEs for the training dataset and testing dataset at S$5748 and S$7549, respectively. To put the MAEs into perspective, we can compare these values with the average of our time-series data, standing at S$481,070. This means that the MAE values are pretty small compared to the average monthly price. These values indicate that our model exhibits only minor deviations from the true values.

Furthermore, the MAE for the testing dataset is only slightly higher than that of the training dataset. This minuscule difference further confirms that our time-series model can generalize well to unseen data, again, at least in short-term periods.

To be continued…

Finally, as we have completed our time-series model, we are ready to get into the second part of this story. In the next segment, I will show you how to leverage the insights we have gained from time-series analysis and use them to improve the robustness of the regression model.

Frequently Used, Contextual References

Resources

Publication

Enhancing The Robustness of Regression Model with Time-Series Analysis — Part 1

Author(s): Mirza Anandita

Enhancing The Robustness of Regression Model with Time-Series Analysis — Part 1

A case study on Singapore’s HDB resale prices. In this story, we demonstrate how to incorporate time-series analysis into linear regression, improving the model’s predictive power.

Introduction

Tools

Data Extraction, Cleaning, and Preprocessing

Exploratory Data Analysis

Price Distribution

Average Price by Town

Monthly Average Price

Monthly Transactions

Time-Series Analysis

Decomposition

Data Splitting

Time-Series Modelling

Forecasting

Model Evaluation

To be continued…

Further Reading:

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Scikit-learn from A to Z: The Complete Guide to Mastering Machine Learning in Python

Self-Supervised Learning: The Next Frontier in Machine Learning

Comprehensive Guide: Handling Missing Values in Machine Learning — A-Z Crash Course

How to Become a Generative AI Engineer in 2025?

How to Fine-Tune Any Large Language Model (LLM)

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Enhancing The Robustness of Regression Model with Time-Series Analysis — Part 1

Author(s): Mirza Anandita

Enhancing The Robustness of Regression Model with Time-Series Analysis — Part 1

A case study on Singapore’s HDB resale prices. In this story, we demonstrate how to incorporate time-series analysis into linear regression, improving the model’s predictive power.

Introduction

Tools

Data Extraction, Cleaning, and Preprocessing

Exploratory Data Analysis

Price Distribution

Average Price by Town

Monthly Average Price

Monthly Transactions

Time-Series Analysis

Decomposition

Data Splitting

Time-Series Modelling

Forecasting

Model Evaluation

To be continued…

Further Reading:

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement