TimesFM — Google’s Foundational Model for Time Series Forecasting
Author(s): Satyajit Chaudhuri
Originally published on Towards AI.
Introduction
Imagine if you could forecast future trends with the same ease that language models understand text. Whether you’re predicting stock prices, healthcare demands, or optimizing logistics, accurate time-series forecasting is crucial. Traditional methods like ARIMA struggle with modern data complexities, but deep learning has shown promise.
Now, imagine a large, pretrained model tailored for time-series data — one that delivers accurate predictions without extensive retraining. This is the groundbreaking work of Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. Their decoder-only model, inspired by NLP giants like BERT, uses a patch-based approach to handle data efficiently. Trained on diverse datasets, it offers near-supervised performance in zero-shot scenarios.
In this article, we will be discussing the model architecture, the training of the model and carry out a hands on forecasting case study where we will the prediction capabilities of TimesFM and compare this model with statistical and machine learning models and another foundational model which has made some buzz in recent times — TimeGPT.
Contents
- Guiding Principles for the TimesFM Architecture
- Components of the Model Architecture
- Dataset for Training
- Hands-on Implementation Guide
- Conclusion
1. Guiding Principles for the TimesFM Architecture
A foundational model for time-series forecasting should be capable of adapting to varying context and horizon lengths, while possessing sufficient capacity to encode all patterns from extensive pretraining datasets. The major guiding principles for this architecture are as follows:
1.1 Patching — Breaking Down the Data
First, think about how you might break down a long book into manageable chapters. This model does something similar with time-series data through a process called “patching.” Instead of handling an entire sequence at once, it divides the data into smaller, more manageable segments called patches. This approach not only speeds up the model’s processing but also helps it focus on smaller, detailed trends within the data. Conversely, extending the patch length to match the context length shifts one away from decoder-only training and its associated efficiencies.
1.2 Decoder-Only Architecture
This model is trained on Decoder-only mode. In other words, given a sequence of input patches, the next patch is predicted as a function of all past patches through model optimization. Similar to LLMs, this can be accomplished in parallel over the entire context window, automatically enabling future predictions after varying numbers of input patches have been observed.
1.3 Generating Longer Forecast Output Patches
In Large Language Models (LLMs), output is generally produced in an auto-regressive manner, generating one token at a time. However, research suggests that for long-horizon forecasting, predicting the entire horizon at once can lead to better accuracy compared to multi-step auto-regressive decoding. This direct prediction approach is challenging when the horizon length is unknown beforehand, such as in zero-shot forecasting, which is the main focus of the discussed model.
To tackle this issue, the authors suggest a compromise by using longer output patches for prediction compared to the input patches. For instance, if the input patch length is 32 and the output patch length is 128, the model is trained as follows: it uses the first 32 time-points to forecast the next 128 time-steps, the first 64 time-points to forecast time-steps 65 to 192, the first 96 time-points to forecast time-steps 97 to 224, and so forth.
During inference, if the model receives a new time-series of length 256 and is tasked with forecasting the next 256 time-steps, it will first predict time-steps 257 to 384. It will then use the initial 256-length input along with the generated output to forecast time-steps 385 to 512. In contrast, a model with an output patch length equal to the input patch length of 32 would need 8 auto-regressive steps to complete the same task, as opposed to just 2 steps in the proposed method.
However, there is a trade-off. If the output patch length is too long, it becomes challenging to handle time-series shorter than the output patch length, such as monthly or yearly time-series in the pretraining data.
2. Components of the Model Architecture
2.1 Input
- The Time series is preprocessed to break the input into contiguous non overlapping patches.
- The patches are then processed by a residual block into a vector of size model_dim.
- A binary mask is also supplied with the inputs to the transformer. The binary mask is used to denote whether the corresponding data point should be considered (0) or ignored (1).
- The residual block is essentially a Multi layer Perceptron Block with one hidden layer with a skip connection.
2.2 The Transformer Architecture
This foundational model uses a Stacked Transformer approach which involves stacking multiple transformer layers, where each layer is composed of two main components: multi-head self-attention mechanisms and feed-forward neural networks.
- Multi-Head Self-Attention: Each transformer layer uses multi-head self-attention to allow the model to focus on different parts of the input sequence simultaneously. This means that for a given output token, the model can consider multiple aspects of the preceding tokens, enhancing its ability to capture complex patterns and dependencies in the data.
- Feed-Forward Networks: Following the self-attention mechanism, each layer has a feed-forward network applied to each position in the sequence independently. This further processes the attended information and enables the model to learn higher-level representations.
- Causal Attention: In the context of time-series forecasting, the authors implement causal attention. This ensures that each output token can only attend to tokens that precede it in the sequence. By doing this, the model adheres to the chronological order of the data, preventing information from future tokens (which should not be available at the time of prediction) from influencing the current prediction.
- Stacking Layers: By stacking multiple transformer layers, the model can progressively build more abstract representations of the input data. Each layer refines the representations learned by the previous layers, enabling the model to capture intricate patterns over varying temporal spans.
- Hyperparameters: This uses two crucial hyperparams;
- Model Dimension: This determines the size of the representation space in each transformer layer.
- Number of Attention Heads: This specifies how many different aspects of the input the model can focus on simultaneously.
The TimesFM architecture as shown above takes an input time-series of a specific length is shown to be broken down into input patches. Then each patch is processed into a vector by a residual block, which is defined in the model definition, to match the model dimension of the transformer layers. The vector is then added to positional encodings.The vector with positional encodings is fed into the stacked transformer layers.
SA refers to self-attention, specifically multi-head causal attention, and FFN refers to the fully connected layer in the transformer. The output tokens are mapped through a residual block to an output of size output_patch_len, which constitutes the forecast for the time window following the last input patch seen by the model so far.
2.3 Output Layers
The output layers are tasked with mapping the output tokens to predictions. The model is trained in a decoder-only mode, which means each output token should predict the part of the time-series following the last input patch. The point to note here is that, unlike many other Time series Forecasting Models, here the input patch length does not have to be equal to the output patch length. This means the model can predict a larger chunk of the time-series based on the information from the input patches.
2.4 Loss Function
The loss function used for this study is Mean Squared Error (MSE). This work is based around point forecasting and hence MSE makes sense for calculating training loss.
2.5 Training
The model is trained using standard mini-batch gradient descent in a decoder-only fashion. This method processes time windows for each time series and across multiple time series.
The masking strategy used is an unique feature here. For each time series in the batch, a random number r between 0 and p−1 is sampled, where p is the patch length. A mask vector m1:r is created, where m1 is set to 1 and the rest are zero. This masks out a fraction of the first input patch. This strategy ensures the model can handle input context lengths from 1 to the maximum context length. This can be well explained with a relevant example as below:
- Lets assume that the maximum context length is 512 and the patch length p is 32, and r=4.
- The output prediction after seeing the first patch is optimized to predict after seeing 32–4=28 time points.
- The next patch is then optimized to predict after seeing 28+32 time points, and so on.
- Repeating this for all such r values ensures the model can handle all context lengths up to 512.
The trained model can then produce forecasts for any horizon using autoregressive decoding.
3. Dataset for Training
The authors use a diverse set of datasets for pretraining the TimesFM model to ensure it captures a wide range of temporal patterns. They source data from Google Trends, which provides search interest data for 22,000 queries over 15 years (2007–2022) in hourly, daily, and weekly granularities, amounting to approximately 0.5 billion time points. Another significant source is Wiki Pageviews, encompassing hourly views of Wikipedia pages from 2012 to 2023. This dataset is aggregated into daily, weekly, and monthly levels, contributing around 300 billion time points.
In addition to real-world data, the authors generate synthetic data using models like ARMA, seasonal patterns, and trends, producing 3 million synthetic time-series, each with 2048 time points. Other real-world data sources include the M4 dataset, hourly and 15-minute electricity data, and hourly traffic data, enhancing the model’s robustness with around 100,000 time-series from the M4 dataset and extensive time-series from traffic and electricity data.
For the training strategy, the authors create a balanced mix of real and synthetic datasets, ensuring equal representation of different granularities (hourly, daily, weekly, monthly). Training batches sample evenly from these granularities, with a minimum time-series length of 256 time points for consistency. Time-series are scaled by the context mean and standard deviation to standardize inputs, and each batch includes 15 primary time-series. This comprehensive approach ensures the TimesFM model is well-prepared to handle various forecasting scenarios across different granularities.
Now let’s start playing with that.
4. Hands-on Implementation Guide
In this section, we will see how to set up the TimesFM Model for Forecasting. Also we will compare the performance of this model with Statistical (AutoETS), ML (Random Forest, XGBoost, LGBM) and Foundational Model (TimeGPT).
The dataset used in this study is taken from Kaggle — Monthly Gold Prices (1979–2021) — Historic Gold Prices of 18 different countries
The refined version of the data that has been used in the below experiment can be downloaded here.
GoldPrices
GoldPrices Gold,India(INR) 31-12-2008,42374.2 30-01-2009,44945.2 27-02-2009,48685.3 31-03-2009,46498.6…
docs.google.com
4.1 Reading the Data
import pandas as pd
df = pd.read_csv("GoldPrices.csv")
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date').resample('MS').mean()
df = df.reset_index() # Reset index to have 'Date' as a column again
print(df.head())
#Let's Visualise the Dataset
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")
plt.figure(figsize=(10, 6))
sns.lineplot(x="Date", y='India(INR)', data=df, color='green')
plt.title('Monthly Gold Prices Over Time')
plt.xlabel('Date')
plt.ylabel('Gold Price in INR')
plt.show()
We will run a seasonal decomposition of the data to check for Trend and Seasonality patterns here.
df.set_index("Date", inplace=True)
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(df['India(INR)'])
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(10, 12))
result.observed.plot(ax=ax1, color='green')
ax1.set_ylabel('Observed')
result.trend.plot(ax=ax2, color='green')
ax2.set_ylabel('Trend')
result.seasonal.plot(ax=ax3, color='green')
ax3.set_ylabel('Seasonal')
result.resid.plot(ax=ax4, color='green')
ax4.set_ylabel('Residual')
plt.tight_layout()
plt.show()
df.reset_index(inplace=True)
4.2 Arranging the Data in Format as Required by the Models
So the Nixtla models as well as TimesFM expect your data to have three distincts column in your univariate time series data. These are:
unique_id : The unique_id column is used to identify different time series within your dataset. — It can be a string, integer, or category type.
- It represents an identifier for each series in your data.
- This is particularly useful when you’re working with multiple time series in the same dataset.
ds (datestamp): The ds column represents the time component of your time series data. — It should be in a format that Pandas can interpret as a date or timestamp.
- Ideally, it should be in the format YYYY-MM-DD for a date or YYYY-MM-DD HH:MM:SS for a timestamp.
- This column is crucial for MLForecast to understand the temporal aspect of your data.
y (target variable): The y column contains the actual values you want to forecast. — It should be numeric.
- This is the measurement or quantity that you’re trying to predict.
df = pd.DataFrame({'unique_id':[1]*len(df),'ds': df["Date"], "y":df['India(INR)']})
Now coming to the Train-Test split, we will use 128 datapoints for training and 24 for test.
train_df = df[df['ds'] <= '31-07-2019']
test_df = df[df['ds'] > '31-07-2019']
4.3 Statistical Forecasting
#install statsforecast
!pip install statsforecast
import pandas as pd
from statsforecast import StatsForecast
from statsforecast.models import AutoARIMA, AutoETS
# Define the AutoARIMA model
autoarima = AutoARIMA(season_length=12) # Annual seasonality for monthly data
# Define the AutoETS model
autoets = AutoETS(season_length=12) # Annual seasonality for monthly data
# Create StatsForecast object with AutoARIMA
statforecast = StatsForecast(df=train_df,
models=[autoarima, autoets],
freq='MS',
n_jobs=-1)
# Fit the model
statforecast.fit()
# Generate forecasts
sf_forecast = statforecast.forecast(h=24) # Forecasting for 24 periods
The results of these are stored in the sf_forecast dataframe.
4.4 ML Forecasting
#install mlforecast
!pip install mlforecast
from mlforecast import MLForecast
from mlforecast.target_transforms import AutoDifferences
from numba import njit
import lightgbm as lgb
import xgboost as xgb
from sklearn.ensemble import RandomForestRegressor
from statsmodels.tsa.seasonal import seasonal_decompose
from mlforecast import MLForecast
from mlforecast.lag_transforms import (
RollingMean, RollingStd, RollingMin, RollingMax, RollingQuantile,
SeasonalRollingMean, SeasonalRollingStd, SeasonalRollingMin,
SeasonalRollingMax, SeasonalRollingQuantile,
ExpandingMean
)
models = [lgb.LGBMRegressor(verbosity=-1), # LightGBM regressor with verbosity turned off
xgb.XGBRegressor(), # XGBoost regressor with default parameters
RandomForestRegressor(random_state=0), # Random Forest regressor with fixed random state for reproducibility
]
fcst = MLForecast(
models=models, # List of models to be used for forecasting
freq='MS', # Monthly frequency, starting at the beginning of each month
lags=[1,3,5,7,12], # Lag features: values from 1, 3, 5, 7, and 12 time steps ago
lag_transforms={
1: [ # Transformations applied to lag 1
RollingMean(window_size=3), # Rolling mean with a window of 3 time steps
RollingStd(window_size=3), # Rolling standard deviation with a window of 3 time steps
RollingMin(window_size=3), # Rolling minimum with a window of 3 time steps
RollingMax(window_size=3), # Rolling maximum with a window of 3 time steps
RollingQuantile(p=0.5, window_size=3), # Rolling median (50th percentile) with a window of 3 time steps
ExpandingMean() # Expanding mean (mean of all previous values)
],
6:[ # Transformations applied to lag 6
RollingMean(window_size=6), # Rolling mean with a window of 6 time steps
RollingStd(window_size=6), # Rolling standard deviation with a window of 6 time steps
RollingMin(window_size=6), # Rolling minimum with a window of 6 time steps
RollingMax(window_size=6), # Rolling maximum with a window of 6 time steps
RollingQuantile(p=0.5, window_size=6), # Rolling median (50th percentile) with a window of 6 time steps
],
12: [ # Transformations applied to lag 12 (likely for yearly seasonality)
SeasonalRollingMean(season_length=12, window_size=3), # Seasonal rolling mean with 12-month seasonality and 3-month window
SeasonalRollingStd(season_length=12, window_size=3), # Seasonal rolling standard deviation with 12-month seasonality and 3-month window
SeasonalRollingMin(season_length=12, window_size=3), # Seasonal rolling minimum with 12-month seasonality and 3-month window
SeasonalRollingMax(season_length=12, window_size=3), # Seasonal rolling maximum with 12-month seasonality and 3-month window
SeasonalRollingQuantile(p=0.5, season_length=12, window_size=3) # Seasonal rolling median with 12-month seasonality and 3-month window
]
},
date_features=['year', 'month', 'quarter'], # Extract year, month, and quarter from the date as features
target_transforms=[AutoDifferences(max_diffs=3)])
fcst.fit(train_df)
ml_forecast = fcst.predict(len(test_df))
The results of these are stored in the ml_forecast dataframe.
4.5 TimeGPT Zero-shot Forecasting
!pip install nixtla
from nixtla import NixtlaClient
# Get your API Key at dashboard.nixtla.io
#Instantiate the NixtlaClient
nixtla_client = NixtlaClient(api_key = 'Your_API_Key')
#Get the forecast
timegpt_forecast = nixtla_client.forecast(df = train_df, h=24, freq="M")
4.6 TimesFM Forecasting
Now after running the forecast for the competitor models we go on to explore the TimesFM model which is our main point of interest in this study.
!pip install timesfm #You might need to restart the kernal to have this installed in your w
# Initialize the TimesFM model with specified parameters
tfm = timesfm.TimesFm(
context_len=128, # Length of the context window for the model
horizon_len=24, # Forecasting horizon length
input_patch_len=32, # Length of input patches
output_patch_len=128, # Length of output patches
num_layers=20,
model_dims=1280,
)
# Load the pretrained model checkpoint
tfm.load_from_checkpoint(repo_id="google/timesfm-1.0-200m")
# Generate forecasts using the TimesFM model on the given DataFrame
timesfm_forecast = tfm.forecast_on_df(
inputs=train_df, # Input DataFrame containing the time-series data for training
freq="MS", # Frequency of the time-series data (e.g., monthly start)
value_name="y", # Name of the column containing the values to be forecasted
num_jobs=-1, # Number of parallel jobs to use for forecasting (-1 uses all available cores)
)
timesfm_forecast = timesfm_forecast[["ds","timesfm"]]
This completes the code for generating the forecasts from the models which are part of this study.
Now we would take all the dates into same format to resolve any format in-consistencies and then merge the forecast dataframes.
# Assuming the DataFrames have a common column 'ds' for the dates
# Convert 'ds' to datetime in all DataFrames if necessary
sf_forecast['ds'] = pd.to_datetime(sf_forecast['ds'])
ml_forecast['ds'] = pd.to_datetime(ml_forecast['ds'])
timegpt_forecast['ds'] = pd.to_datetime(timegpt_forecast['ds'])
timesfm_forecast['ds'] = pd.to_datetime(timesfm_forecast['ds'])
# Now perform the merges
merged_fcst = pd.merge(sf_forecast, ml_forecast, on='ds')
merged_fcst = pd.merge(merged_fcst, timegpt_forecast, on='ds')
merged_fcst = pd.merge(merged_fcst, timesfm_forecast, on='ds')
#Adding the actuals to the dataframe from test_df
merged_fcst = pd.merge(merged_fcst, test_df, on='ds')
#Keep only relevant columns
merged_fcst = merged_fcst[["unique_id", "ds", "AutoARIMA", "AutoETS", "LGBMRegressor", "XGBRegressor", "RandomForestRegressor", "TimeGPT", "timesfm"]]
The Forecasted data frame head looks like this
The full forecast can be found in the below link.
TimesFM_Forecast_Comparison
TimesFM_Forecast_Comparison (1)…
docs.google.com
4.7 Evaluation of the Models
import numpy as np
def calculate_error_metrics(actual_values, predicted_values):
actual_values = np.array(actual_values)
predicted_values = np.array(predicted_values)
metrics_dict = {
'MAE': np.mean(np.abs(actual_values - predicted_values)),
'RMSE': np.sqrt(np.mean((actual_values - predicted_values)**2)),
'MAPE': np.mean(np.abs((actual_values - predicted_values) / actual_values)) * 100
}
result_df = pd.DataFrame(list(metrics_dict.items()), columns=['Metric', 'Value'])
return result_df
# Extract 'Weekly_Sales' as actuals
actuals = merged_fcst['y']
error_metrics_dict = {}
for col in merged_fcst.columns[2:-1]: # Exclude 'Weekly_Sales'
predicted_values = merged_fcst[col]
error_metrics_dict[col] = calculate_error_metrics(actuals, predicted_values)['Value'].values # Extracting 'Value' column
error_metrics_df = pd.DataFrame(error_metrics_dict)
error_metrics_df.insert(0, 'Metric', calculate_error_metrics(actuals, actuals)['Metric'].values) # Adding 'Metric' column
print(error_metrics_df)
As seen from this evaluation models, amongst the compared model and for the data set under evaluation, TimesFM is the best model after AutoETS when compared on the basis of MAE, RMSE and MAPE.
The complete notebook can be found here.
Google Colab
TimesFM
colab.research.google.com
5. Conclusion
TimesFM provides a reliable Time series foundational model approach that can be considered in the forecaster’s toolbox. TimesFM employs a decoder-only transformer architecture, which contrasts with typical encoder-decoder frameworks used in many existing time-series models. This design choice simplifies the model while maintaining high performance on forecasting tasks. As seen in the study also, when compared with another succcessfull Time Series Foundational Model — TimeGPT, it outperforms the former for this experimental use case.
This study is among the first to compare the performance of two distinct foundational time-series models and experimentally demonstrate their applications. It also contrasts TimesFM with established statistical and machine learning models, providing a comprehensive guide for readers to evaluate the suitability of this model for their forecasting tasks.
References
- Das, A., Kong, W., Sen, R., & Zhou, Y. (2023). A decoder-only foundation model for time-series forecasting. arXiv preprint arXiv:2310.10688.
- Garza, A., & Mergenthaler-Canseco, M. (2023). TimeGPT-1. arXiv preprint arXiv:2310.03589.
- TimesFM HuggingFace Page— https://huggingface.co/google/timesfm-1.0-200m
- MLForecast — https://nixtlaverse.nixtla.io/mlforecast/forecast.html
- Statforecast — https://nixtlaverse.nixtla.io/statsforecast/index.html
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI