Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

# Time Series Regression Using Transformer Models: A Plain English Introduction

Last Updated on July 12, 2023 by Editorial Team

#### Author(s): Ludovico Buizza

Originally published on Towards AI.

## A plain English brief introduction to time series data regression/classification and transformers, as well as an implementation in PyTorch

I am working on a project that uses transformer models for the diagnosis of neurodegenerative diseases. The idea is if you can gather movement data of a patient, you should be able to analyse this data and determine whether or not a patient is ill and how ill the patient is. This is an example of time series regression and classification, and transformer models, which have been used plenty in NLP problems, are very well suited to this task [1].

In this article I will provide a plain English introduction to time series data, transformer models and adapting them to the task at hand and provide a very brief case study.

As usual, I will be providing an implementation, and my code is available on my GitHub.

## What is time series data?

The obvious question, with a simple answer. Time series data is simply a series of data points indexed in time order. An easy example to imagine is measuring the temperature outside your window once every minute, for a day. You would end up with 24×60 = 1440 measurements of temperature, in time order. Thinking about it, a lot of the data we collect in the world is time series data — think about financial data (stock prices), weather data, or any sensor data in wearable tech that one might wear.

The thing to note with time series data is that we can analyse it in two ways — we can either ignore the fact there is time information and treat it as a collection of measurements (e.g., if you want to determine whether the average temperature is higher in your car or your kitchen in the same time period), or we can use the time information (e.g., if you want to determine what the temperature trend in your kitchen is over the course of a few days).

What can we do with time series data?

Assuming now that we want to use this time information, what can we actually do? Well, all tasks will fall into one of three categories. These are shown in the figure below, and they are forecasting, classification and regression.

Forecasting is the most common of the three: given a set of time-indexed data points, the task is to predict what this data might do in the future. An example could be — given the temperature measurements from the last week, can you forecast what the temperature might be tomorrow?

Classification is the next most common: given a set of time-indexed data points, the task is to say in which (pre-defined) category the data falls. Imagine we’ve collected walking data from humans, dogs, and cats (for example, with the accelerometer on an Apple Watch). We want to be able to determine which of these created the data when we do classification.

Finally, regression tries to give a numerical value to a set of time-indexed data. For example, using data accelerometer data collected from a human, we want to be able to say at what speed the person is walking.

In the past, using machine learning algorithms for these tasks has proven to be quite challenging. It turns out that traditional ML techniques are bad at modelling long-range dependencies (so a model will just look at data that is close together when making a prediction and not data that is far apart), for various reasons outlined in this great article. This is where the transformer comes in, introduced in 2017 [2].

## What is a transformer?

Please note in the following discussion, I might refer to the use of sentences as input to models. Transformers were originally created for language processing tasks, and sometimes are easier to explain using examples from language rather than time series data. A sentence can be thought of as a piece of time series data — the order that the words are in matters.

I will try to explain this very simply, but for those wanting a thorough introduction, there are some brilliant articles on Medium, such as this one.

Transformers originally came about for language modeling tasks. Previous models, such as the RNN, would take an input sequence (such as a sentence, or time-series data) and process it sequentially, meaning one by one in the order that the data came in. This caused the issue of vanishing gradients and made them unsuitable for modeling long-range dependencies. Transformer models, however, attempt to capture relationships between all elements in the input sequence simultaneously. They do this using self-attention.

For a mathematical introduction to self-attention, please see here. Below is a simple intuition behind what it does.

Imagine we are trying to translate the following sentence from one language to another:

The quick brown fox jumps over the lazy dog.

If I was translating this to French, I would want to make sure that I know the words for fox and dog in French. I would also want to be able to get the order of them correct (i.e., the fox jumps over the dog, not the other way around). I would maybe pay less attention to the word the, as it’s not the subject or object of the sentence. Attention tries to do just this — it tries to look at an input and see what bits of the input are more important to the task at hand.

Imagine we had the same sentence, but the task at hand was to extract colours from the sentence. Attention would weigh heavily on the word brown and weigh less on the other words in the sentence.

In the example in the gif above, we are translating from French to English. If hidden state is more highlighted it means that attention weights it more heavily. You’ll see that when generating the English I, only the French Je is paid attention to. However, for the English a, both suis and étudiant are paid attention to. This is because in English, we need to know what comes after the connective to know whether to use a or an in our translation. If the final word had been pomme, the translation would have been I am an apple. This is why attention is useful — it lets the model understand what is important for the task at hand.

Transformers for time series data

Now we have some kind of intuition behind the attention mechanism. Let’s dive into what a transformer actually is.

Below is an image of a very simple transformer. At this level, all that it is is an encoder and decoder network. We can think of an encoder as something that takes input and puts it in a format that is useful for the model to have for the task at hand (e.g., imagine extracting just the background color from a series of images because you want to know what color the sky was in the images). A decoder does the opposite — it takes the machine-readable format (the encoding) and puts it back into a format that a human can understand (e.g., from before, taking the encoding and outputting the word blue).

In the transformer, the encoder performs self-attention, as described above (to weigh the various parts of the input more or less), and then passes it through a feed-forward layer — see here for a good introduction. The decoder does the opposite.

In the original transformer proposed by [2], the authors actually made the encoder run 6 times (i.e., the output was passed into another identical encoder which in turn had self-attention and feed-forward layers a further 5 times), and similar for the decoder.

The decoder is needed in the case of machine translation because we expect the output to be a sequence of words, and not just a single number. In the time series data case, this would be similar to forecasting, where the output is a series of data points. See here and here for examples of this.

However, in our case, we actually just care about outputting a single number (in the case of regression), or a single vector (in the case of classification). So we can actually just get rid of the decoder part of the transformer entirely and focus only on the encoder. The encoder will work in exactly the same way, providing a useful representation of the data for the model. Once we have the encodings, we then simply add a linear layer on top that flattens the output of the encoder and projects it to one number.

Adapting the transformer for multivariate data

Great — we now have a transformer-based model that can take a series of time-indexed datapoints and output a classification or regression. Up until now, we have considered univariate time series. This means that for each point in time, we just have one measurement (for example, temperature). Sadly, most data is not so simple — even data from a single accelerometer actually has three points for each time-step (one for each axis, x, y, and z). In my case — using transformers for disease diagnostics using motion sensor data, I have 17 sensors around the body (totalling 17×3=51 datapoints per time-step).

Will our transformer work for this data? Well… kind of. It won’t break, and it will still give some intelligible outputs, but we can make it better. Using this formulation of attention, attention will treat all of the datapoints at one time step the same, and weigh them the same. Whilst this is ok, it means that our model will only really model variations in time, and not in space.

In my use-case, (I’m drastically oversimplifying here) imagine that a telltale sign of the disease is that when you move one arm in a certain way there is a twitch in your foot at exactly the same time. The model would not be able to pick up on that, losing out on that important information.

What we actually want is depicted on the right of the above image. Whilst on the left, each variable at a given time-step has the same amount of attention given to it, on the right we have attention that can vary both in the space and time dimensions.

The simplest way to achieve this is described by [3] — we linearise our input data. This means that rather than at each time-step having a vector to describe our data (one value for each variable), we simply repeat the time index multiple times to cover our data. For example, if we have T time-steps and N variables, our input data would originally have a shape of TxN. In this modified version, we’d instead have an input of NT x 1.

A note on complexity: the computational complexity of a transformer scales with the width of the vector (originally T, now NT) as O(T²). Going from a vector of width T to NT will now give us a computational complexity of O(N²T²). In my case, with 51 dimensions, this would mean a complexity that is 2600 times greater than in the original case. This is far from ideal, and the paper [3] discusses various ways to mitigate this (one for another post!).

## Test case — human activity recognition (HAR)

To test if my model works, I am going to benchmark it on a multivariate timeseries classification problem, which is human activity recognition. I have taken the dataset from this resource, which is a great database of time series classification problems. The original dataset is presented in [4].

In the study a total of 24 participants in a range of gender, age, weight, and height performed 6 activities in 15 trials in the same environment and conditions: walking downstairs, walking upstairs, walking, jogging, sitting, and standing. The task is: given a 2.5s snippet of data taken from an iPhone 6s accelerometer, can we detect which of these activities is being performed?

The dataset is therefore a time-series with 12 dimensions (3 dimensions for each attitude, gravity, user acceleration, and rotation rate). In the original paper, using a variety of (non-transformer) techniques, the best-performing models were able to obtain an accuracy of ~92%. When I tried this with the vanilla transformer above, training for less than 10 minutes and doing no hyperparameter tuning, I was able to obtain an accuracy of 90%.

This shows great promise that with more training time and better hyperparameters, my transformer should be able to beat this baseline. The improved transformer should be able to do even better! Given the purpose of this test was to just ensure it was working as expected, I did not decide to push the performance any further.

## Appendix: implementation in PyTorch

As usual, my implementation is available on my GitHub. Please ensure you download the data from here. My implementation has a Transformer class, as well as a Trainer and DataHandler for ease of training in PyTorch. The transformer has been taken and adapted from this great repo (and paper!).

## References

[1] Wen, Q., Zhou, T., Zhang, C., Chen, W., Ma, Z., Yan, J. and Sun, L., 2022. Transformers in time series: A survey. arXiv preprint arXiv:2202.07125.

[2] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I., 2017. Attention is all you need. Advances in neural information processing systems, 30.

[3] Grigsby, J., Wang, Z. and Qi, Y., 2021. Long-range transformers for dynamic spatiotemporal forecasting. arXiv preprint arXiv:2109.12218.

[4] Malekzadeh, M., Clegg, R.G., Cavallaro, A. and Haddadi, H., 2019, April. Mobile sensor data anonymization. In Proceedings of the international conference on internet of things design and implementation (pp. 49–58).

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI