Take the GenAI Test: 25 Questions, 6 Topics. Free from Activeloop & Towards AI

Publication

Latest   Machine Learning

Are Language Models Actually Useful for Time Series Forecasting?

Author(s): Reza Yazdanfar

Originally published on Towards AI.

Time Series

Time series is one of the most challenging lines of work in machine learning, and this has made researchers less reluctant to work on it. However, solving time series tasks like anomaly detection, time series forecasting, … are vital in a wide variety of industries and could save tons of money.

What happened in Language processing?

The laws of scaling, initiated by OpenAI, showed that models can generalize better on more raw data, and the result was ChatGPT. Revolutionary!! Since then, LLMs captured the attention of all, politicians to researchers.

What is going on now?

Since then researchers have been trying to use LLMs for time series! I mean it makes sense to an extent because both language data and time series are sequential, and researchers thought if the llms could generalize well on language data, then probably it can be for time series.

There are a bunch of cool works get done about it, you can read more here, here, here, here, and here.

The question is “How many LLMs are really useful for time series tasks?

I’d argue some works are showing promising future for time series such as time series reasoning and social understanding (agents) that used LLMs to achieve what they intended.

Time series reasoning:

Using Large Language Models (LLMs) for time series reasoning can enhance time series reasoning by integrating three key forms of analytical tasks: Etiological Reasoning, Question Answering, and Context-Aided Forecasting.

  1. Etiological Reasoning involves hypothesizing potential causes behind observed time series patterns, enabling models to identify scenarios that most likely generated given time series data.
  2. Question Answering enables models to interpret and respond to factual queries about time series, such as identifying trends or making counterfactual inferences about changes in the data.
  3. Context-Aided Forecasting allows models to leverage additional textual information to enhance their predictions about future data points, integrating relevant context to improve forecast accuracy.

However, current LLMs demonstrate limited proficiency in these tasks, performing marginally above random on etiological and question-answering tasks and showing modest improvements in context-aided forecasting.

Social understanding:

Using Large Language Models (LLMs) for time series analysis can significantly enhance social understanding by enabling agents to systematically analyze and predict societal trends and behaviors. LLM-based agents utilize real-world time series data from various domains such as finance, economics, polls, and search trends to approximate the hidden world state of society. This approximation aids in the formulation and validation of hypotheses about societal behaviors by correlating time-series data with other information sources like news and social media.

By including these diverse data streams, LLMs can provide deep insights into multi-faceted and dynamic societal issues, facilitating complex and hybrid reasoning that holds both logical and numerical analyses.

Moreover, the hyper portfolio task within SocioDojo allows these agents to make investment decisions based on their understanding of societal dynamics, which serves as a proxy to measure their social comprehension and decision-making capabilities.

This method ensures that the agents are not merely performing historical data fitting but are actively engaging with and adapting to the continuous flow of real-world data, making their analyses and predictions relevant and applicable in real-life scenarios.

Pretty mind blowing, ain’t it?

However, these new models do not use the natural reasoning abilities of pretrained LMs when it comes to time series.

Do LLMs really help time series tasks?

A new study just showed that if we replace this language with attention layers, the performance will not change dramatically. Even if they get removed completely, the performance gets better. This even improves both training and inference speed by up to three orders of magnitude.

The researchers chose three ablations: deleting the llm-component or replacing it. The three modifications are as follow:

  1. W/O LLM (Figure 1 (b)). We remove the language model entirely, instead passing the input tokens directly to the reference method’s final layer.
  2. LLM2Attn (Figure 1 (c)). We replace the language model with a single randomly-initialized multi-head attention layer.
  3. LLM2Trsf (Figure 1 (d)). We replace the language model with a single randomly-initialized transformer block.
Figure 1: Overview of all LLM ablation methods. Figure (a) represents time series forecasting using an LLM as the base model. In some works, the LLM components are frozen, while in others, they undergo fine-tuning. Figure (b) shows the model with the LLM components removed, retaining only the remaining structure. Figure (c) replaces the LLM components with a single-layer self-attention mechanism. Figure (d) replaces the LLM components with a simple Transformer. [source]

The first left one (Figure (a)) is the model with LLM, as the baseline here.

Datasets

The datasets are mainly the benchmark datasets in all other time series research: ETT, Illness, Weather, Traffic, Electricity, Exchange Rate, Covid Deaths, Taxi (30 min), NN5 (Daily) and FRED-MD.

Results

Forecasting performance of all models — Time-LLM, LLaTA, and OneFitsAll and results from our ablations. All results are averaged across different prediction lengths. Results in Red denote the best-performing model. # Wins refers to the number of times the method performed best, and # Params is the number of model parameters. “-” means the dataset is not included in the original paper [source]

As you can see the ablations are superior to Time-LLM in all cases, LLaTA in 22 out of 26 and OneFitsAll in 19 out 26 cases. The metrics used here are MAE and MSE, mean-absolute-error and mean-square-error, respectively.

It can be concluded that LLMs don’t improve the performance on time series forecasting tasks in a meaningful way.

Now let’s take a look at the computation:

In time series tasks, LLM (LLaMA and GPT-2) significantly increases training time. The table shows the number of model parameters (in millions) and total training time (in minutes) for three methods predicting over a length of 96 on ETTh1 and Weather data. Compared with original method “w/ LLM” are “w/o LLM”, “LLM2Attn” and “LLM2Trsf”. [source]

Time-LLM, OneFitsAll, and LLaTA take, on average, 28.2, 2.3 and 1.2 times longer than the modified models. That says, the trade off from the computation of LLMs for time series doesn’t not worth it.

Now, we can have a look at whether pretraining with language datasets could result in better time series forecasting or not?

The research took four different combinations: Pretrain + Finetune, Random Initialization + Finetune, Pretrain + No Finetuning and Random Initialization + No Finetuning.

Randomly initializing LLM parameters and training from scratch (woPre) achieved better results than using a pretrained (Pre) model. “woFT” and “FT” refer to whether the LLM parameters are frozen or trainable. [source]

As you can see, language knowledge offers very limited improvement for forecasting. However, “Pretrain + No Finetuning” and the baseline “Random Initialization + No Finetuning” performed the best 5 times and 0 times, respectively, insinuating that Language knowledge does not help during the finetuning process.

For the input shuffling/masking experiments on ETTh1 (predict length is 96) and Illness (predict length is 24), the impact of shuffling the input on the degradation of time series forecasting performance does not change significantly before and after model modifications. [source]

In this experiment, three types of shuffling are used: shuffling the entire sequence randomly (“sf-all”), shuffling only the first half of the sequence (“sf-half”), and swapping the first and second halves of the sequence (“ex-half”).

As the results show that LLM-based models are no more vulnerable to input shuffling than their ablations.

Conclusion

This research showed that it’s better to leave traditional time series forecasting to what they’re used to, instead of trying to use Large Language models for time series tasks.

It doesn’t mean not to do anything; there are new things that could be interesting to pursue in the intersection of Time Series and Large Language Models.

I’m working on a new search engine, nouswise, would love if you check it out and let me know your thoughts. You can contact me through any social media, X platform or LinkedIn.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓