Deep Learning for Time Series Forecasting
Last Updated on January 18, 2025 by Editorial Team
Author(s): Sarvesh Khetan
Originally published on Towards AI.
Table of Contents :
- 1. Feed Forward Neural Network
- 2. 1D Convolution Neural Network
- 3. Hidden Markov Models (HMM)
- 4. Conditional Random Fields (CRF)
- 5. Recursive Neural Network (RvNN)
- 6.1. [1990s] Recurrent Neural Network (RNN) β Unidirectional
- 6.2. Bidirectional RNN Model
- 7.1. [1997] Long Short Term Memory (LSTM) RNN β Unidirectional
- 7.2. Bidirectional LSTM RNN
- 8.1. [2014] Gated RNN Model β Unidirectional
- 8.2. Bidirectional Gated RNN Model
- 9.1. Transformer Encoder Model β Bidirectional
- 9.2. Transformer Decoder Model β Unidirectional
Feed Forward Neural Network
If you think carefully you will conclude that this can be seen as a multivariate regression problem and hence we can use a FFNN to solve this
Issues with FFNNs :
- FFNNs did not take the complete history of sequential information into account to make the prediction, it just took a window of sequential information into account to make the prediction
2. FFNNs will fail for variable size inputs. What do you mean by variable size inputs?? As you can see in FFNNs you have to give N input features to make the prediction but what if we have less than N or more than N no of input features??? FFNN fails to give a prediction in such a case.
Recurrent Neural Network (RNN) β Unidirectional
To solve the issues with FFNN, researchers developed RNNs, you can read more about RNNs here
Recurrent Neural Networks(RNNs) for Sequence Classification
Sequence Modelling
khetansarvesh.medium.com
Note : Time Series Forecasting is a regression task but in above blog I have shown all the equations assuming a classification task, you can change the equations according to a regression Task!!
Bidirectional Recurrent Neural Network (BiRNN)
Single Layer Architecture
Hence we can see that BiRNN consumes twice as much memory for weights and biases as a RNN
Stacked BiRNN
Long Short Term Memory RNN (LSTM RNN)
To solve the issues with RNNs, researchers developed LSTMs, you can read more about LSTMs here
LSTM for Sequence Classification
Sequence Classification
Sequence Classificationkhetansarvesh.medium.com
Note : Time Series Forecasting is a regression task but in above blog I have shown all the equations assuming a classification task, you can change the equations according to a regression Task!!
Below I have implemented a 4 hidden layer stacked LSTM RNN architecture to solve the univariate time series forecasting problem of google stock price prediction.
Time-Series-Modelling/univariate_time_series/LSTM-Stock-Price-Prediction.ipynb at main Β·β¦
Performed literature survey on various architectures like FFNN, RNN, LSTM RNN, Gated RNN, and Transformers (SOTA Model)β¦
github.com
Bidirectional LSTM (BiLSTM) RNN
Single Layer Architecture :
Same as what we saw in RNN here, just replace Recurrent unit with LSTM unit
Stacked Architecture :
Same as what we saw in RNN here, just replace Recurrent unit with LSTM unit
Gated RNN Model β Unidirectional
Single Layer Architecture
Concept here remains exactly same as what we have seen in RNNs and LSTMs just that we change the RNN / LSTM cell to GRU cell
Below is the internal working of a GRU cell
# Create a single GRU cell
gru_cell = nn.GRUCell(input_size=10, hidden_size=10)
Stacked Architecture
gru_stack = nn.GRU(input_size=10, hidden_size=10, num_layers=3)
# 3 single GRU cells stacked on top of each other
Did GRU Solve LSTM Issue?
- GRUs are lighter than LSTMs because we already know that LSTMs recurrent units have 3 gates but here in GRU we reduced these 3 gates to 2 gates, thus making it lighter in computation and hence faster in training.
- GRUs were proposed in 2014 where we have reduced the computations and yet it works equally well like LSTMs in most cases.Always remember that there is no guarantee that GRUs will work better than LSTMs, the only benefit of GRU over LSTM is that training time decreased significantly.
Issues with GRU :
Though using Gated RNN we reduced the training time and also handled the vanishing gradient problem but in this world of big data now, we want to use multiple GPUs in parallel to train our model to reduce to training time further, but with RNN / LSTM RNN / Gated RNN this parallel training is just impossible because all these model are by nature sequential. FFNNs can be trained in parallel but not these.
Hence researchers wanted to make use of this superpower of a FFNN and hence researchers got their minds into thinking and came up with this new model wherein we can use a FFNN instead of a recurrent networks like RNN / LSTM RNN / Gated RNN and its variants to model sequences.
Bidirectional Gated RNN Model
Single Layer Architecture :
Same as what we saw in RNN here, just replace Recurrent unit with GRU unit
Stacked Architecture :
Same as what we saw in RNN here, just replace Recurrent unit with GRU unit
Transformer Encoder Model β Bidirectional
Single Layer Transformer Model
Implement Transformers (Bidirectional) from Scratch in Pytorch for Sequence Classification
Transformer Architecture
khetansarvesh.medium.com
Stacked Transformer Model
Researchers have noticed that 12 to 24 hidden layers works really well in most cases !!
class TransformerEncoder(nn.Module):
def __init__(self, TransformerEncoder, N = 24):
super(TransformerEncoder, self).__init__()
self.layers = clones(TransformerEncoder, N)
def forward(self, x):
for layer in self.layers:
x = layer(x)
return x
# in single transformer we saw how to create 'transformer_encoder_layer'
# now stacking the above transformer encoder layer 24 times
stacked_transformer_encoder = TransformerEncoder(
transformer_encoder_layer,
24
)
'''
Instead of using our own implementation we can use pytorch implementation
stacked_transformer_encoder = nn.TransformerEncoder(
transformer_encoder_layer,
num_layers=12
)
'''
Efficient Transformers
Efficient Transformers
Transformers
Transformerskhetansarvesh.medium.com
Transformer Decoder Model β Unidirectional
Single Layer Architecture
We saw bidirectional transformer architecture here, now to make this unidirectional we will just replace the self attention layer in this architecture with masked self attention layer
In self attention, we look at both forward and backward sequential information i.e. say we are at x4 then it will look at x1, x2, x3, x5, x6, β¦. xm to calculate matured representation of x4.
But in masked self attention we will look only at the backward information to make it unidirectional i.e. say we are at x4 then it will look at x1, x2, x3 only i.e. vectors to left of it
To convert from self attention to mask self attention we just have to make some minor changes in step 2 of the vector implementation of self attention that we saw here. The change goes as follows β¦
Hence matrix implementation equivalent of this would look something like this
Now we saw multi headed self attention in case of Transformer Encoder Model, similarly here also we can have multi headed masked self attention
Stacked Architecture
# Create a single Transformer decoder cell
transformer_decoder = nn.TransformerDecoderLayer(d_model=768, nhead=12)
# Stack Transformer decoder cells
transformer_decoder_stack = nn.TransformerDecoder(decoder_layer=transformer_decoder, # from above
num_layers=6) # 6 Transformer decoders stacked on top of each other
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI