Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

How to Design a Pre-training Model (TSFormer) For Time Series?
Latest

How to Design a Pre-training Model (TSFormer) For Time Series?

Last Updated on January 6, 2023 by Editorial Team

Last Updated on July 8, 2022 by Editorial Team

Author(s): Reza Yazdanfar

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

There have been numerous attempts in NLP (Natural Language Processing) tasks recently, and most of them take advantage of using pre-trained models. NLP tasks’ feed is mostly the data created by human beings, full of fertile and excellent information that almost can be considered a data unit. In time-series forecasting, we can feel a lack of such pre-trained models. Why can’t we use this advantage in time series as we do in NLP?! This article is a detailed illustration of proposing such a model. This model is developed by considering two viewpoints and has 4 sections from input to output. Also, the Python code is added for a better understanding.

TSFormer

  • It is an unsupervised pre-training model for Time Series based on TransFormer blocks (TSFormer) with the well-implemented Mask AutoEncoder (MAE) strategy.
  • This model is able to capture very long dependencies in ourΒ data.

NLP and TimeΒ Series:

To some extent, NLP information and Time Series data are the same. They both are sequential data and locally sensitive, which means to be in relation to its next/previous data points. BTW, there are some differences which I’m going to say in the following:

There are two facts (in fact, these are differences) we should consider while proposing our pre-trained model as we have done for NLPΒ tasks:

  1. The density is much lower in Time Series data than in NaturalΒ Language
  2. We need a longer sequence length of Time Series data than NLPΒ data

TSFormer FromΒ ZERO

Its process, like all other models, is just like a journey (Nothing new, but a good perspective). As I told you about the strategy of MAE, the main architecture is going through an encoder and then processed into a decoder, finally just to reconstruct theΒ target.

You can see its architecture in FigureΒ 1:

Figure 1 |Β [source]

That’s it mate!! There is nothing more than this Figure. πŸ˜† However; if you want to know how it operates, I’ll illustrate this in the following section:

The process: 1. Masking 2. Encoding 3. Decoding 4. Reconstructing Target

1. Masking

Figure 2 |Β source

This is the first step to providing feed to the next step (Encoder). We can see that the input sequence (SαΆ¦) has been distributed into P patches with the length L. Consequently, the length of the sliding window which is used to forecast the next time step is P xΒ L.

The making ratio (r) is 75% (relatively high, isn’t it?); it is just about making a self-supervised task and makes the Encoder more productive.

The main reasons for doing this (I mean patching input sequence) are:

  1. Segments (i.e. patches) are better than separateΒ points.
  2. It makes it simpler to utilize downstream models (STGNN takes a unit segment asΒ input)
  3. By this, we can decrease the size of input for theΒ Encoder.

2. Encoding

Figure 3 |Β [source]

As you can see, it is an order of Input Embedding, Positional Encoding, and Transformer Blocks. The Encoder can perform just on unmasked patches. Wait!! What wasΒ that??

Input Embedding

Q) The previous step was about masking and now I said we need unmaskedΒ ones??!!

A) Input Embedding.

Q) How?

A) It is actually a linear projection that converts the unmasked to latent space. Its formula can be seenΒ below:

Eq 1.

W and b are learnable parameters and U are the model input vectors in d hidden dimension.

Positional Encoding

Simple Positional Encoding layers are used to append new sequential information. The term β€œLearnable” is added, which helped to show better performance than sinusoidal. Also, learnable positional embedding shows good results for timeΒ series.

Transformer Blocks

The researchers used 4 layers of transformers, just lower than the common amount in computer vision and NLP tasks. The illustration of the type of transformer used here is the most used transformer architecture. You can read it thoroughly in β€œAttention is all you Need” published in 2017. BTW, I’m going to give you a summary of it(This short illustration comes from one of my previous articles, β€˜Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting’):

Transformers are new deep learning models which are presented at a rising rate. They adopted the mechanism of self-attention and showed a significant increase in model performance on challenging tasks in NLP and Computer Vision. The Transformer architecture can be envisaged into two parts known as encoder and decoder, as it is illustrated in Figure 4,Β below:

Figure 4. The Transformer Architecture |Β [source]

The main point of transformers is their independence from localities. In contrast to other popular models like CNNs, transformers are not limited by localization. Also, we do not propose any CNN architecture in transformers; instead, we use Attention-based structures in transformers, which allows us to accomplish better results.
Attention Architecture can be summarized in FigureΒ 5:

Figure 5 (Left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel. |Β [source]

The function of Scaled Dot-Product Attention is Eq.Β 2

Eq. 2Β [source]

Q (Query), K (Key), and V(Vector) are the inputs of our attention.

For a complete fundamental realization of transformers, look at β€œAttention is all you need.” It gives you a great understanding of attention and transformers; in fact, for the first time, I completely understood this important model through thisΒ paper.

I think this amount of summary is enough for transformers

3. Decoding

Figure 6 |Β [source]

The decoder includes a series of Transformer Blocks. It works with all sets of patches (masked tokens, etc.). In contrast, MAEs (Masked AutoEncoders), there are no positional embeddings due to patches has already positional information. The number of layers is just one. After that, simple MLPs (Multi-Layer Perceptions) were used (I’d like to be sure that there is no need to illustrate MLPsπŸ˜‰), which makes the output length equal to eachΒ patch.

4. Reconstructing Target

Figure 7 |Β [source]

Parallelly, the computation over masked patches is for every data point (i). Also, mae (Mean-Absolute-Error) is chosen as the loss function of the main sequence and reconstruction sequence.

Eq 3.

That’s it!! You did it! Not too hard, is it??!πŸ˜‰πŸ˜‰

Now, let’s have a look at the architecture:

Figure 8 |Β [source]

The End

The source of the code and architecture is this and this, respectively.

You can contact me on Twitter here or LinkedIn here. Finally, if you have found this article interesting and useful, you can follow me on medium to reach more articles fromΒ me.


How to Design a Pre-training Model (TSFormer) For Time Series? was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓