LLM Driven AutoForecasting with Sktime’s `craft()`

Last Updated on May 27, 2026 by Editorial Team

Author(s): Benedikt Heidrich

Originally published on Towards AI.

LLM Driven AutoForecasting with Sktime’s `craft()`

AutoML is relying in many cases on some kind of grid searches. This is expensive. However, if humans are selecting hyperparameters, they often have some feeling about a good choice based on previous experiences. So I wondered, if LLM can act like an human expert and determine the parameters by considering the problem and using the knowledge stored in its weights. By reading about the autoresearch package from Andrew Karpathy, I got inspired to use something similar for time series forecasting. I wrote an sktime forecaster that uses an LLM to propose and refine forecasting pipelines. It uses the `craft()` method from sktime to turn LLM-generated blueprints into sktime forecasters. Thus, I call it `LLMBlueprintForecaster`.

In this post, I shortly explain the `craft()`method, and how it is used in the `LLMBlueprintForecaster` to create an agentic AutoML loop. I also discuss current limitations and potential improvements.

`craft()`ING Pipelines in sktime

The `craft()` method in sktime is a version of `eval()`. It takes a string that is formatted like a Python constructor call of an sktime estimator. This string can reference any class that is in the scope of sktime’s `all_estimators()`. When you call `craft()` on that string, it returns a fully constructed sktime object. Internally, `craft()` is doing all the imports required to create the estimator. Besides creating single estimators, `craft()` can also create compositors, such as pipelines: “TransformedTargetForecaster([Deseasonalizer(sp=12), Detrender(), AutoARIMA()])”

from sktime.registry import craft
# Bare estimator
forecaster = craft("NaiveForecaster(strategy='last')")
# Full pipeline - no imports needed
pipeline = craft(
 "TransformedTargetForecaster([Deseasonalizer(sp=12), Detrender(), AutoARIMA()])"
)

Why is `craft()` so useful for agentic systems?

As you can see, `craft()` is basically accepting a text string that looks like python code. This is a perfect fit for LLMs, since they are generating text and are trained also on a lot of code (obviously also python code).

I.e., when using `craft()`, no complex parsing logic or tool calls has to be defined. The LLM can directly generate the spec string that `craft()` expects, and then we can directly evaluate the resulting object.

Note, since `craft()` can perform imports it can also execute malicious code.

How `craft()` powers the `LLMBlueprintForecaster`

The `LLMBlueprintForecaster` is inspired by `autoresearch`. However, instead of iteratively improving the training of a neural network. Here, the goal is to iteratively find the best sktime forecaster for a given time series. To do this, the `LLMBlueprintForecaster` gets information about time series and a list of all available forecasters in sktime. To provide information about time series, I implemented three methods:

simple statistics about the time series,
an LLM provided description of the time series. Therefore, the time series is plotted and the plot is provided to the LLM,
and, providing the plot of the time series directly to the LLM.

Then, the LLM performs the following loop

Based on the results of previous iterations, the LLM is prompted to propose `n_blueprints` new forecasters. Each blueprint has a name and a spec string that can be passed to `craft()`.

{
"name": "Deseason + AutoARIMA",
"spec": "TransformedTargetForecaster([Deseasonalizer(sp=12), AutoARIMA()])"
},
{
"name": "BoxCox + ETS",
"spec": "TransformedTargetForecaster([BoxCoxTransformer(), ExponentialSmoothing(trend='add')])"
}
]

`craft()` is used to instantiate each spec, and the resulting forecasters are evaluated on the data using the provided cross-validation splitter. The evaluation results are collected and if a there is a new best blueprint, it is saved.
Note, in case of an error, the LLM has the option to fix this forecaster.
All scores and potential errors are fed back to the LLM as part of the next prompt. Based on this information, the LLM can learn and propose an improved set of new forecasters.

This loop is repeated for `n_iterations`, and at the end the best blueprint is refit on the full data and is used as the final model.

Example with Ollama

This example shows how you can use this forecaster to do forecasting. As an example, I have chosen the Airline Passengers dataset. It consists of an univariate time series with a increasing trend and a yearly seasonality. As training dataset, I used the first 132 months, and hold out the last 12 for testing.

I instantiate the `LLMBlueprintForecaster` with a cross-validation splitter to provide and evaluation scheme to the forecaster to evaluate the blueprints. Besides that, I also passed the parameters n_iterations, n_blueprints, api_params, and the model that should be used. Since the forecaster is using `litellm`, it is sufficient to just pass a string specifying model provider and model name. I have chosen to use the Gemma 4 model that I hos locally using ollama. On a Mac Book Air, the runtimes for 5 iterations with 3 blueprints is around 4 minutes.

from sktime.split import ExpandingWindowSplitter
cv = ExpandingWindowSplitter(initial_window=12 * 5, step_length=12, fh=fh)
# Create the LLM Blueprint Forecaster
forecaster = LLMBlueprintForecaster(
 cv = cv,
 # LLM to use for blueprint generation
 model="ollama/gemma4:e4b",
 # number of generate-evaluate-refine cycles
 n_iteratipons=5,
 # blueprints per iteration
 n_blueprints=3,
 # LLM sampling temperature
 api_params={"temperature": 0.7},
)
# Fit - this triggers the LLM blueprint search
forecaster.fit(y_train, fh=fh)
forecaster.predict()

Below (Figure 1) you see the resulted prediction of the forecaster, as well as the ground truth and the training data.

Figure 1: Input time series, test data, and the forecast of the LLMBlueprintForecaster.

In this example the best blueprint was a pipeline consisting of a log transformation and Prophet.

Insights

Besides just applying it, I would also like to report some interesting insights. First, I would like to focus on how the three different ways to provide the time series to the LLM impacts the performance. Second, I would like to dig a bit into the trace of the LLM. I.e., seeing how often the forecaster gets improved if the model is iterating over multiple rounds etc. Third, how does the LLM knows which estimator exist.

Note, for this analysis, I decided not to use a model hosted via Ollama. Instead, I used Mistral-Small, Mistral-Medium, and Mistral-Large. As dataset, I still used the airline dataset since with longer time series the proposed estimators took far more time to be fitted.

What Image Description work best?

The first question was does it matter how I pass the information about the time series to the LLM. As mentioned above, the LLMBlueprintForecaster has the following three methods:

Provide only simple statistics about the time series (basic)
Provide a plot of the time (image)
Provide a description of the plotted time series. This description is created by an VLM (described_plot)

In the results below (Figure 2) you see, that the overall best result was achieved by Mistral-Small with the basic statistical description. However, the results are very fluctuating, and no clear pattern is visible.

Figure 2: The average Mean Squared Errors across the different folds.

To explain why there are no clear pattern, I performed the following small analysis. I analysed which estimators are picked how often by the LLM across five different runs (Figure 3). The estimator picked most often was the TransformedTargetForecaster, which is unsurprising, as it serves as the pipeline wrapper rather than a standalone model.

Figure 3: Average occurrences of each estimator by the three different image description methods. The number of occurrences are average across five runs

When comparing how often the other estimators are selected when using different description methods for the time series, we see that there are minor fluctuations but no systematic differences across methods. This might indicate that the LLM is not really able to get a proper understanding of the time series.

Furthermore, we also observe that the LLM is only selecting a very small subset of the available transformers and estimators. The cause of this might be either a bias in the system prompt or in the training data, since methods like differentiation, ARIMA, or Prophet are widely discussed methods in the forecasting literature.

Wrapping up, a possible reason why the description method does not seem to impact the performance might be that the LLM is not really understanding time series. However, I think there are also further valid hypotheses, such as:

The airline dataset is so simple that no differences in the description methods occur.
The LLM is not really reasoning about the provided the time series. Thus, the provided information is not really used. Instead, the LLM is more or less selecting forecasters in a random order.

Thus to finally assess the reason why the description method does not seem to impact the performance, a more extensive analysis must be performed. This might also include some benchmarking tasks that assess the time series understanding capabilities of LLMs.

Evaluation of best forecasters

I also tracked how the best forecaster found by the LLM so far evolves through time. And what the best model per round is. See Figure 4.

Figure 4: The best forecaster found so far (left), and the best forecaster found in each round (right).

Interestingly, for the used dataset, the best score drops very fast in the beginning but does not really change in later steps. When checking the best model per round, we see that the results are strongly fluctuating without any trends. This suggests that the LLM is not really learning from previous evaluation results in a meaningful way. It might be that the LLM picked rather arbitrary some forecaster than reasoned about the time series. If this is the case, this might explain the fluctuations of the best forecaster per round. Regarding the fast convergence, the lack of understanding time series might also be an explanation here. E.g., if the model is better understanding the time series, it might create better models.

Discussion

In this blogpost, I evaluated on one time series, thus, the insights are limited. However, I think we can still learn some things here:

While it seems natural that the documentation is helpful for the LLM to select the estimator. The LLM can not handle the complete documentation at once. Thus, it might be useful to use other approaches like tool calling to provide the LLM information about certain estimators.
I had the feeling that the used LLM cannot really understand time series. Thus, using and test frontier models might lead to different results. Also activating reasoning can also improve the results here.
Finally, depending on the selected forecaster and the length of the time series. Fitting the forecaster can take quite long. Thus, introducing some time budget for the estimator could help to avoid long running forecasters.

Wrapping up, I see potential in using LLMs for automated time series forecasting. However, there is a lot room for improvements at the side of the LLMs but also at the side of the harness. And we might see more in the future here.

Conclusion

`craft()` is a small function with outsized impact for agentic ML systems. It solves the problem of bridging LLM text output to library objects cleanly, without requiring custom parsers.

The `LLMBlueprintForecaster` demonstrates that a complete AutoML search loop — propose, evaluate, refine, select — can be built on this single primitive. As LLMs improve at structured and constrained output, the combination becomes even more capable.

References:

The LLMBlueprintForecaster is currently under review in the sktime repo. https://github.com/sktime/sktime/pull/9855
https://github.com/karpathy/autoresearch

Further Readings:

If you are interested in further reading regarding Agentic or LLM based forecasting, you might check out:

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

LLM Driven AutoForecasting with Sktime’s `craft()`

Author(s): Benedikt Heidrich

LLM Driven AutoForecasting with Sktime’s `craft()`

`craft()`ING Pipelines in sktime

Insights

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

Full-Stack Data Scientists for the Agentic Coding World

Building Production-Grade AI Skills with Snowflake Cortex AI Function Studio

I Tried 10 AI Agent Frameworks in 2026 — Here’s the Honest Guide I Wish I Had Earlier

How One Spring Boot Optimization Saved Our Startup $30,000 a Year

Inside Palantir AIP: How the World’s Most Controversial AI Platform Actually Works

What Is a Reverse Proxy? (And Why Every Backend Developer Should Care)

What Claude Opus 4.8 Actually Changes If You’re Building Agents

QWEN 3.7 Max Worked For 35 Hrs Straight And The Results Were Mind-blowing

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

LLM Driven AutoForecasting with Sktime’s `craft()`

Author(s): Benedikt Heidrich

LLM Driven AutoForecasting with Sktime’s `craft()`

`craft()`ING Pipelines in sktime

Insights

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement