Stock Price Change Forecasting with Time Series: SARIMAX
Last Updated on January 6, 2023 by Editorial Team
Author(s): Avishek Nag
Machine Learning, Statistics
High-level understanding of Time Series, stationarity, seasonality, forecasting, and modeling withΒ SARIMAX
Time series modeling is the statistical study of sequential data (may be finite or infinite) dependent on time. Though we say time. But, time here may be a logical identifier. There may not be any physical time information in a time series data. In this article, we will discuss how to model a stock price change forecasting problem with time series and some of the concepts at a highΒ level.
Problem Statement
We will take Dow Jones Index Dataset from UCI Machine Learning Repository. It contains stock price information over two quarters. Letβs explore the datasetΒ first:
We can see only a few attributes. But there are other ones also. One of them is βpercent_change_next_weeks_priceβ. This is our target. We need to forecast it for subsequent weeks given that we have current week data. The values of the βDateβ attribute indicate the presence of time-series information. Before jumping into the solution, we will discuss some concepts of Time Series at a high level for our understanding.
Definition of TimeΒ series
There are different techniques for modeling a time series. One of them is the Autoregressive Process (AR). There, a time series problem can be expressed as a recursive regression problem where dependent variables are values of the target variable itself at different time instances. Letβs say if Yt is our target variable there are a series of values Y1, Y2,β¦ at different time instances, then,
for all time instance t. Parameter Β΅ is the mean of the process. We may interpret theΒ term
as representing βmemoryβ or βfeedbackβ of the past into the present value of the process. Parameter Ρ determines the amount of feedback and Ιt is information present at time t that can be added as extra. Definitely here by βprocessβ, we mean an infinite or finite sequence of values of a variable at different time instances. If we expand the above recurrence relation, then weΒ get:
It is called the AR(1) process. h is known as theΒ Lag.
A Lag is a logical/abstract time unit. It could be hour, day, week, year etc. It makes the definition moreΒ generic.
Instead of the only a single previous value, if we consider p previous values, then it becomes AR(p) process and the same can be expressed as:
So, there are many feedback factors like Ρ1, Ρ2,..Ρp for AR(p) process. It is a weighted average of all pastΒ values.
There is another type of modeling known as MA(q) process or Moving Average process which considers only new information Ι and can be expressed similarly as a weightedΒ average:
Stationarity & Differencing
From all of the equations above, we can see that if Ρ or ΞΈ< 1 then the value of Yt converges to Β΅ i.e., a fixed value. It means that if we take the average Y value from any two-interval, then it will always be close to Β΅, i.e., closeness will be statistically significant. This type of series is known as Stationary time series. On the other hand, Ρ > 1 gives explosive behavior and the series becomes Non-stationary.
The basic assumption of time series modeling is stationary in nature. Thatβs why we have to bring down a non-stationary series to a stationary state by differencing. It is definedΒ as:
Then, we can model βYt again as time series. It helps to remove explosiveness as stated above. This differencing can be done several times as it is not guaranteed that just doing it one time will make the series stationary.
ARIMA(p,d,q) process
ARIMA is the join process modeling with AR(p), MA(q), and d times differencing. So, here Yt contains all the terms of AR(p) and MA(q). It says that, if an ARIMA(p,d,q) process is differentiated d times then it becomes stationary.
Seasonality &Β SARIMA
A time series can be affected by seasonal factors like a week, few months, quarters in a year, or a few years in a decade. Within those fixed time spans, different behaviors are observed in the target variable which differs from the rest. It needs to be modeled separately. In fact, seasonal components can be extracted out from the original series and modeled differently as said. It is definedΒ as:
where m is the length of the season, i.e. degree of seasonality.
SARIMA is the process modeling where seasonality is mixed with ARIMAΒ model.
SARIMA is defined by (p,d,q)(P, D, Q) where P, D, Q is the order of the seasonal components.
SARIMAX &Β ARIMAX
So far, we have discussed modeling the series with target variable Y only. We havenβt considered other attributes present in theΒ dataset.
ARIMAX considers adding other feature variables also in the regression model.
Here X stands for exogenous. It is like a vanilla regression model where recursive target variables are there along with other features. With reference to our problem statement, we can design an ARIMAX model with target variable percent_change_next_weeks_price at different lags along with other features like volume, low, close, etc. But, other features are considered fixed over time and donβt have lag dependent values, unlike the target variable. The seasonal version of ARIMAX is known asΒ SARIMAX.
Data Analysis
We will start by analyzing the data. We will also learn some other concepts of time series along with theΒ way.
Letβs first plot AutoCorelation Function(ACF) and PartialAutoCorelation Function (PACF) using statsmodel library:
import statsmodels.graphics.tsaplots as tsa_plots
tsa_plots.plot_pacf(df['percent_change_next_weeks_price'])
And thenΒ ACF:
tsa_plots.plot_acf(df['percent_change_next_weeks_price'])
ACF gives us the correlation between Y values at different lags. Mathematically covariance for this can be definedΒ as:
A cut-off in the ACF plot indicates that there is no sufficient relation between lagged values of Y. It is also an indicator of order q of the MA(q) process. From the ACF plot, we can see that ACF cuts off at zero only. So, q should beΒ zero.
PACF is the partial correlation between Y values, i.e., the correlation between Yt and Yt+k conditional on Yt+1,..,Yt+k-1. Like ACF, a cut-off in PACF indicates the order p of the AR(p) process. In our use case, we can see that p isΒ zero.
Decomposing components
We will now, how many components are there in the timeΒ series.
The first graph shows the actual plot, the second one shows the trend. We can see that there is no specific trend (upward/downward) of the percentage_change_next_weeks_price variable. But seasonal plot reveals the existence of seasonal components as it shows waves of ups &Β downs.
Stationarity checkβββADFΒ test
The characteristic equation of the AR(p) process is givenΒ by:
From our previous discussion, we can say that an AR(1) process is stationary if Ρ < 1 and for AR(p), it should be Ρ1 + Ρ2+..+Ρp <1. So the if the solution of the characteristic equation is of theΒ form:
i.e, if it has unit-roots, then the time series is not stationary.
We can formally test this with Augmented Dicky-Fuller test likeΒ below:
As the p-value is less than 0.05, so the series is stationary.
Building theΒ model
We will start building theΒ model.
Pre-processing
We will do some pre-processing like converting categorical variable stock to numerical, removing the β$β prefix from price attributes, and fill all null values withΒ zero.
We will also separate out the target & feature variables.
We will split the dataset into training &Β test.
TimeSeriesSplit incrementally splits the data in a cross-validation manner. We have to use the last X_train, X_testΒ set.
Auto-modeling
We will use auto_arima from pmdarimaΒ library.
It tries out with different SARIMAX(p,d,q)(P,D,Q) models and chooses the best one. We used X_train as exogenous variables and seasonal start order m as 2(i.e., start from m for trying out different seasonalΒ orders).
We got the output asΒ below:
So, what we analyzed in the Data Analysis section came outΒ true.
auto_arima checks stationarity, seasonality, trend everything.
The best model has p=0, q=0 and as the model is stationary, d=0. But, as we saw it has some seasonal components, its order isΒ (2,0,1).
We will build the model with statsmodels and trainingΒ dataset
Model details (clipped):
It shows all feature variableΒ weights.
Forecasting
Before testing the model we need to discuss the difference between prediction & forecast. In a normal regression/classification problem, we use the term prediction very often. But, time series is a little different. Forecasting always considers lags into account. Here, to predict the value of Yt, we need value of Yt-1. And of course, Yt-1 will also be a forecasted value. So, it is a sequential & recursive process rather thanΒ random.
Mathematically, for an AR(2)Β process,
^Yn and ^Yn-1 are the previous forecasted values. This way the chain continues. In the case of ARIMAX, feature values are not dependent on time, so when we do a forecast, we feed previous Y values along with the same feature XΒ values.
Now, its time to test theΒ model:
from sklearn.metrics import mean_squared_error
mean_squared_error(result, Y_test)
We will plot the actual vs predicted results.
We can also see the error distribution.
model.plot_diagnostics()
plt.tight_layout()
plt.show()
Errors are normally distributed with zero mean and constant variance which is a goodΒ sign.
Jupyter notebook can be foundΒ here:
Recently I authored a book on ML (https://twitter.com/bpbonline/status/1256146448346988546)
Stock Price Change Forecasting with Time Series: SARIMAX was originally published in Towards AIβββMultidisciplinary Science Journal on Medium, where people are continuing the conversation by highlighting and responding to this story.
Published via Towards AI