Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Free: 6-day Agentic AI Engineering Email Guide.
Learnings from Towards AI's hands-on work with real clients.
Why Your Sales Forecast Is Always 20% Wrong (And How To Make It 12% Wrong)
Data Science   Latest   Machine Learning

Why Your Sales Forecast Is Always 20% Wrong (And How To Make It 12% Wrong)

Author(s): Kamrun Nahar

Originally published on Towards AI.

Real World Sales Forecasting Playbook

The Single Most Useful Picture I Have Ever Seen

The thing that changed how I worked was not a model. It was a 2×2 grid. Two professors put it on paper in 2005. The grid has saved me from picking the wrong model approximately 600 times.

Why Your Sales Forecast Is Always 20% Wrong (And How To Make It 12% Wrong)
The Forecastability Map. Every product on your shelf belongs to one of these four boxes. Pick the wrong box, pick the wrong model.

This is the Syntetos-Boylan classification. Two axes. Two numbers per SKU.

The x-axis is the Average Demand Interval (ADI). Take all the time periods in your history. Count how many had at least one sale. Now divide the total number of periods by that. If you sell something every day, ADI is 1. If you sell it every other day on average, ADI is 2. The bigger ADI gets, the rarer the SKU.

The y-axis is the squared coefficient of variation (CV²) of the non-zero demand sizes. Take only the periods where you sold something. Compute the standard deviation of the quantities. Divide by the mean. Square the result. This tells you how variable the size of demand is when it does happen.

The thresholds are 1.32 for ADI and 0.49 for CV². Why those exact numbers. They came from empirical analysis of real industrial data. The math is in a 2005 paper. Don’t go chasing the original PDF, the boundaries are good enough as rules of thumb.

Four boxes pop out of this.

SMOOTH | INTERMITTENT
|
Frequent. | Rare.
Steady. | Steady when it happens.
|
------------|------------
|
ERRATIC | LUMPY
|
Frequent. | Rare AND variable.
Variable. | The nightmare.

Every product on your shelf needs a different forecasting approach. Treating them all the same is why your one big model fails on the long tail.

Let me walk you through each box. Pour yourself something. We’ll be a while.

Box One. Smooth.

This is the dream. Sells every day. Roughly the same quantity. The textbook stuff.

A small grocery store sells milk like this. So does a hospital pharmacy with basic painkillers. So does a power company billing residential customers. Frequent. Steady. The data scientist’s friend.

For smooth demand, almost any classical method works. AutoARIMA. Exponential Smoothing (ETS). A simple regression with calendar features and a couple of lags. The fancy methods barely beat the simple methods. You will see WAPE around 8 to 15 percent and you will look like a magician. The model will not save you. The data is already easy.

The mistake people make on smooth data is over-engineering. They reach for an LSTM. They tune hyperparameters for two weeks. They get a 0.4 percent improvement and call a meeting.

import pandas as pd
from statsforecast import StatsForecast
from statsforecast.models import AutoETS, AutoARIMA, SeasonalNaive
df = pd.read_csv("smooth_skus.csv", parse_dates=["ds"]) # cols unique_id ds y
sf = StatsForecast(
models=[AutoETS(season_length=7), AutoARIMA(season_length=7),
SeasonalNaive(season_length=7)],
freq="D", n_jobs=-1) # daily data, 7-day weekly cycle, all cores
sf.fit(df) # fits one of each model per unique_id
fcst = sf.predict(h=28) # 4 weeks out
print(fcst.head()) # check it. should look boring. boring is good.

Line by line, because the Reddit poster asked for thoroughness.

import pandas as pd. Pandas is the spreadsheet library every Python data person uses. The as pd is a community nickname so we never have to type the long name.

from statsforecast import StatsForecast. The orchestrator class from Nixtla's library. It is fast because it parallelizes across SKUs without you having to write a loop.

from statsforecast.models import AutoETS, AutoARIMA, SeasonalNaive. Three classical models. AutoETS picks the best exponential smoothing variant for you. AutoARIMA does the (p,d,q) search you used to do by hand at 2 AM. SeasonalNaive is the dumb baseline that says "last week's same weekday." Always include the dumb baseline.

df = pd.read_csv(...). Reads the CSV. The library expects three columns. unique_id (which SKU), ds (which date), y (how many sold).

StatsForecast(models=[...], freq="D", n_jobs=-1). We instantiate. freq="D" is daily. n_jobs=-1 means "use every CPU." The library is genuinely fast.

sf.fit(df). Behind the scenes, it groups by unique_id and fits each model to each SKU's history. No loop. You're welcome.

fcst = sf.predict(h=28). Predict 28 days ahead. Each model produces its own column.

print(fcst.head()). Eyeball test. For a smooth SKU, the three forecasts should agree closely. If they wildly disagree, your data isn't actually smooth, and you're in the wrong box.

Why this matters. If your SKU is genuinely smooth, this snippet is your whole pipeline. You don’t need LightGBM. You don’t need Prophet. You don’t need a paper from NeurIPS.

The smooth demand dream. Live it while you can.

Box Two. Intermittent.

Now it gets interesting.

A roofing supply store sells a particular size of slate tile maybe four times a month. When it sells, it sells two or three boxes at a time. Always two or three. Never twenty. Never zero point seven. Frequency is low. Quantity is consistent.

For demand that is rare but steady-quantity, the classical models go quiet. ARIMA wants regular data. ETS wants a trend or a season. There isn’t one. There are just zeros, interrupted by a normal number, then more zeros.

Enter John Croston. 1972. Yorkshire. He probably had a slide rule. He had a brilliant idea.

Forecast two things separately. How big an order is when it happens. How often orders happen. Divide the first by the second.

That’s the whole method. From Watergate-era. And it still beats every neural network on sparse data most of the time. Most of the time. We will get to the exceptions.

Croston’s method has a known problem. It is positively biased. It tends to over-forecast. The smoothing parameter beta makes it worse. In 2005, the same Syntetos and Boylan who made the quadrant figured out a fix. Multiply by (1 - alpha/2). That's it. That fix is now called the Syntetos-Boylan Approximation (SBA) and it is the default people should be using.

There is a further variant called TSB (Teunter-Syntetos-Babai). This one solves a different problem. Croston only updates its forecast when a non-zero demand happens. If a product goes obsolete (nobody buys it anymore), Croston keeps forecasting like it’s still selling, because it never updates downward. TSB updates the probability of a sale every period, including the zero periods. So if your SKUs go obsolete (and in retail, they all eventually do), use TSB.

Here’s the working code.

import pandas as pd
from statsforecast import StatsForecast
from statsforecast.models import CrostonClassic, CrostonSBA, TSB, IMAPA
df = pd.read_csv("intermittent_skus.csv", parse_dates=["ds"])
models = [
CrostonClassic(), # the original 1972 method
CrostonSBA(), # bias-corrected version
TSB(alpha_d=0.1, alpha_p=0.1), # handles obsolescence
IMAPA(), # ensembles across time scales
]
sf = StatsForecast(models=models, freq="D", n_jobs=-1)
sf.fit(df)
fcst = sf.predict(h=28) # demand rate over 28-day horizon
print(fcst.head())

CrostonClassic(). The original method. Known bias. Good baseline.

CrostonSBA(). Syntetos-Boylan corrected. Less bias. Almost always slightly better.

TSB(alpha_d=0.1, alpha_p=0.1). Smoothing constants. alpha_d is for demand size, alpha_p is for the probability of a sale. Lower values mean smoother (slower to react). 0.1 is a sane default.

IMAPA(). Intermittent Multiple Aggregation Prediction Algorithm. Long name. Simple idea. It forecasts at multiple time aggregation levels (daily, weekly, monthly) and combines them. Often a slight winner. Always worth including.

Why this matters. Intermittent SKUs are the long tail of every catalog. They are usually 60 to 80 percent of your product list. If you use ARIMA or Prophet on them, you will quietly hemorrhage forecast accuracy across the whole company. Knowing this one trick puts you ahead of most teams.

Croston’s method. Older than my mom’s first car. Still beats your neural net on niche stock.

Box Three. Erratic.

Frequent sales. Wildly different sizes.

A bookstore sells two copies of a book on Monday. Fourteen on Tuesday. Three on Wednesday. Then a hundred and ten on Thursday because a famous person quoted it on a podcast. The book sells every day. The quantity swings like a pendulum.

This is where most retail SKUs that “are doing fine” actually live. Most data scientists call this “noisy.” It is. The job is to handle the noise without freaking out.

Three things help.

Use a robust loss function. Squared error punishes one massive miss the same as eight medium misses. That’s bad when one weird day is a TikTok spike. Use quantile loss or Tweedie loss instead. Quantile loss at the 50th percentile is forecasting the median, which is less sensitive to outliers than the mean.

Predict ranges, not points. Forecast the 10th, 50th, and 90th percentiles. Now you have a forecast distribution, not a guess. The planner gets to decide how cautious to be. You stop being responsible for the magic number.

Engineer features that explain the spikes. Most “random” volatility isn’t random. It’s a payday. A holiday. A weather event. A scheduled promotion. The lift you get from a calendar with regional holidays will dwarf the lift from any clever model. I have watched a team add a single is_payday_week column and see WAPE drop by six points.

A small story. There is a small chain in the Midwest that sells outdoor furniture. For three years their April forecasts were terrible. Always too low. Their fancy LSTM did not know. Their AutoARIMA did not know. A summer intern noticed that the local university had a “spring move-in” weekend the first Saturday of April, and every parent in a ninety-mile radius needed a patio set. They added one column. is_university_weekend. April error dropped from 28 percent to 11 percent. The intern got a graduation gift.

The takeaway. For erratic SKUs, do not chase a better model. Find the business event that nobody coded. Then code it.

Erratic demand. The signal is in the calendar. The model isn’t doing the work

Box Four. Lumpy.

Now I’m going to tell you something most courses won’t.

Some products are unforecastable.

Wedding-cake toppers shaped like horses. Replacement parts for a discontinued 1998 refrigerator. The 4XL pink hoodie. These sell rarely. When they sell, they sell in wildly different quantities. ADI high. CV² high. The math is screaming “you do not have enough data here.”

The professional response is not to build a fancier model. The professional response is to stop forecasting and switch to inventory policy.

Inventory policy is a different beast. You set a reorder point and a reorder quantity. When stock falls below the reorder point, you buy more. You do not pretend you know when the wedding-cake topper is going to sell. You make sure that when it does, you have one or two in the back.

This is genuinely the answer for the lumpy quadrant. A small buffer of stock plus a reorder rule is better than a sophisticated forecast that’s just guessing. And it has been the answer in operations research for decades. The fact that it sounds anti-climactic is not a defect.

Look at it this way. A weather forecaster does not “predict” earthquakes. They warn that a region has higher risk. Lumpy SKUs are earthquakes. Don’t predict the day. Predict the risk.

Hot take. A serious chunk of “AI demand forecasting” software is selling forecasting for products that should be on inventory policy. The buyer feels good. The product is at peace. The model is hallucinating.

The lumpy quadrant. Where data scientists go to learn humility.

The Hidden Step. Sales Is Not Demand.

Before we go further, I need to tell you the single most uncomfortable truth in retail forecasting.

The y column in your data is not demand. It is sales. Those are different.

If a shelf was empty for two days, the sales for those days are zero. The demand was not zero. Customers came in, wanted to buy the thing, and walked out because the thing was missing. That is censored demand and your model sees the zero and thinks “ah, demand fell.” Your model is wrong.

Same applies if a SKU is “out of distribution” because of a system migration, a typhoon, a delivery truck that didn’t show, or a SKU that simply doesn’t exist in that region anymore.

The fix is operational, not statistical.

1. Get a daily availability table. For each (date, store, SKU), was it available?
2. Mask out days where availability was zero.
3. EITHER drop those rows entirely, OR fill them with a proxy (e.g., the average of comparable available days), and flag them.
4. Add a column `was_stocked_out` so the model knows.

Most teams forget step one and the entire forecasting effort is biased downward. Forever.

The empty shelf is not zero demand. It’s invisible demand.

The Other Thing That Wins. Features.

I want to land a hard truth right here. Feature engineering beats model selection by a factor of about ten to one. Anyone who has shipped a forecast in anger will agree.

What features actually move WAPE downward.

LAGS y at t-1, t-7, t-14, t-28, t-365
ROLLING STATS mean, std, max, min over the last 7, 14, 28 days
CALENDAR dayofweek, month, weekofyear, is_weekend, is_holiday
HOLIDAYS regional. country-level is not enough.
PROMOTIONS flag and discount depth. as a feature, not magic.
PRICING current price, price change vs last week
WEATHER 7-day forecast, NOT actual. you don't know actuals.
EXTERNAL EVENTS payday, school start, university move-in, sports games
PRODUCT META category, brand, supplier, lifecycle stage
CROSS-SERIES sales of related SKUs, last-week category total

A note on weather, since the Reddit poster asked about exogenous variables. You can only use features whose values you know at the time you’re forecasting for. Tomorrow’s weather forecast is OK. Tomorrow’s actual weather is not. This is the most common bug in real forecasting code. People train using actual weather and then are confused when production accuracy crashes. The model leaked the future during training.

Write on Medium

Here is the LightGBM forecasting recipe most of the M5 competition winners used. The M5 was a Kaggle competition where Walmart released 5 years of sales data across 3000 stores and 30,000 SKUs. Almost every top solution was a LightGBM ensemble. Not a neural network. Not Prophet. Trees, with the right features, trained on the right slices.

import pandas as pd
import lightgbm as lgb
from mlforecast import MLForecast
from mlforecast.lag_transforms import RollingMean, RollingStd
df = pd.read_csv("sales.csv", parse_dates=["ds"]) # unique_id ds y
exog = pd.read_csv("exog.csv", parse_dates=["ds"]) # promos, prices, etc
df = df.merge(exog, on=["unique_id", "ds"], how="left")
mlf = MLForecast(
models=[lgb.LGBMRegressor(objective="quantile", alpha=0.5,
n_estimators=600, learning_rate=0.05,
num_leaves=64, min_child_samples=20)],
freq="D",
lags=[1, 7, 14, 28, 365],
lag_transforms={1: [RollingMean(7), RollingStd(28)],
7: [RollingMean(28)]},
date_features=["dayofweek", "month", "weekofyear", "quarter"],
)
mlf.fit(df, static_features=["category", "store", "region"], dropna=True)
fcst = mlf.predict(h=28, X_df=exog_future) # exog_future covers horizon
print(fcst.head())

Walking through it.

from mlforecast import MLForecast. The library that wraps any sklearn-style regressor and adds the time-respecting feature creation. It will never leak the future into training features. That is the whole point.

from mlforecast.lag_transforms import RollingMean, RollingStd. Helpers to compute rolling statistics on lagged values. Critical for capturing recent trend without leaking.

df.merge(exog, ...). Brings the extra features in. Promotions, prices, anything else you have at the (unique_id, date) grain.

lgb.LGBMRegressor(objective="quantile", alpha=0.5, ...). Gradient boosted decision trees. objective="quantile", alpha=0.5 says "predict the median, not the mean." This makes the model robust to outliers, which is exactly what you want for erratic demand.

n_estimators=600, learning_rate=0.05. Six hundred trees, slow learning. Conservative. Resistant to overfitting on noisy SKUs.

num_leaves=64, min_child_samples=20. Tree complexity controls. 64 leaves per tree, minimum 20 samples per leaf. Prevents the model from carving up the data into useless tiny corners.

lags=[1, 7, 14, 28, 365]. Tells the library to create five lagged features per SKU. Yesterday's sales. Last week same day. Two weeks ago. Roughly a month ago. Same day last year.

lag_transforms={1: [RollingMean(7), RollingStd(28)], 7: [RollingMean(28)]}. On top of the raw lags, compute a 7-day rolling mean and a 28-day rolling std on the lag-1 feature, and a 28-day rolling mean on the lag-7 feature. This gives the model a smoothed signal of recent behavior.

date_features=[...]. Calendar features. Free signal. Day of week is usually the strongest one in retail data.

static_features=["category", "store", "region"]. Things that don't change per SKU over time. The model treats them as plain categorical features.

dropna=True. Drops the early rows where lag features can't be computed yet.

mlf.predict(h=28, X_df=exog_future). Predicts 28 days ahead. The X_df argument carries the future values of any exogenous columns you need at prediction time. The library auto-fills the calendar columns.

Why this matters. This recipe replaces about 80 percent of the “should I use Prophet or N-BEATS or PatchTST” debates in industry. LightGBM with the right features is the workhorse. Almost always. Get good at this and you save yourself reading another 100 papers.

LightGBM in the gym. Prophet on the couch. Feature engineering is the protein shake.

The Hierarchy Problem. When Numbers Refuse To Add Up.

You have built your model. You have produced forecasts for every (store, SKU) pair. You proudly send them to finance.

Two hours later, the CFO emails. “Why does your store-level total not match the regional VP’s forecast?” Three hours after that, the regional VP emails. “Why does my forecast not match the company total?” By 5 PM, everyone is in a Slack channel called #forecast-disagreement and you are quietly opening LinkedIn.

This is hierarchical forecasting.

There are four classical approaches.

Bottom-up. Forecast the leaves. Sum upward. Simple. The total is the sum of the parts. But each leaf forecast is noisy because each leaf has thin data.

Top-down. Forecast the root (company total). Split it down by historical share. Smooth at the top. Loses local detail. Bad for product-level decisions.

Middle-out. Forecast at a middle level. Disaggregate down. Aggregate up. A compromise.

MinT (minimum trace reconciliation). Forecast every level independently. Then mathematically nudge the forecasts so they’re coherent (they add up) and so the total error variance is minimized. This was published in 2019 by Wickramasuriya, Athanasopoulos, and Hyndman. It is the modern winner. It works because it uses information from every level when reconciling, not just the leaves or the root.

You don’t have to implement this from scratch. Nixtla’s hierarchicalforecast library does it.

import pandas as pd
from hierarchicalforecast.core import HierarchicalReconciliation
from hierarchicalforecast.methods import BottomUp, TopDown, MinTrace
Y_df = pd.read_csv("hierarchical_sales.csv", parse_dates=["ds"])
S_df = pd.read_csv("summing_matrix.csv") # who-rolls-up-into-who
base = pd.read_csv("base_forecasts.csv") # output of your base model
reconcilers = [BottomUp(), TopDown(method="forecast_proportions"),
MinTrace(method="mint_shrink")]
hrec = HierarchicalReconciliation(reconcilers=reconcilers)
Y_rec = hrec.reconcile(Y_hat_df=base, S=S_df, tags={})
print(Y_rec.head()) # everything adds up. CFO calms down.

HierarchicalReconciliation. The orchestrator. Takes your base forecasts and produces coherent versions.

BottomUp, TopDown, MinTrace. Three reconciliation methods. Run them all and compare on a backtest.

S_df. The summing matrix. A 0/1 matrix that encodes "this row rolls up into that row." Most teams write a helper function that walks their hierarchy and emits this matrix.

MinTrace(method="mint_shrink"). The shrinkage variant of MinT. The covariance matrix estimation is hard with limited data so we shrink it toward a diagonal. This is more stable on real-world data than the full-covariance version.

hrec.reconcile(...). Returns reconciled forecasts. Now your SKU-level numbers add up to your store totals which add up to the regional totals which add up to the company number. The Slack channel quiets down.

Why this matters. Coherence is a political problem disguised as a statistical one. A regional VP whose number doesn’t match the CFO’s number will not trust your forecast next quarter. Reconciliation is the technical fix for an organizational reality.

The numbers finally agree with each other. The CFO sleeps. So do you.

The Metric Problem. Or, How To Stop Lying To Yourself.

Here is a fact people learn the hard way. The metric you optimize is the metric you become. Pick the wrong metric and your model will be excellent at the wrong thing.

The default metric most data scientists reach for is MAPE (Mean Absolute Percentage Error). Average of |actual - forecast| / actual. Bounded between zero and infinity. Sounds great.

MAPE has one problem. It divides by the actual value. If your actual value is zero, MAPE is infinite or undefined. And in retail data, zeros are everywhere. Daily SKU forecasts have zero days constantly. Your MAPE explodes. People then “handle” it by removing zero days. Now you’ve thrown out half your data. Now your metric is lying to you in a different direction.

Stop using MAPE on demand data. I’ll say it twice. Stop using MAPE on demand data.

Here is the cheat sheet for real metrics.

+-----------+-------------------------------+-------------------------+
| Metric | When to use | Worst trap |
+-----------+-------------------------------+-------------------------+
| MAE | Single series, similar scale | Can't compare scales |
| RMSE | Punish large errors more | Outliers dominate |
| MAPE | Demos and marketing | Breaks on zeros |
| sMAPE | Symmetric percentage | Still ugly on zeros |
| WAPE | Total business volume | Hides per-SKU pain |
| MASE | Compare across series | Needs naive baseline |
| RMSSE | M5-style multi-series | Same logic as MASE |
| Pinball | Quantile forecasts | Different per quantile |
+-----------+-------------------------------+-------------------------+

The two metrics you should default to are WAPE and MASE.

WAPE is sum(|actual - forecast|) / sum(actual). It is dollar-weighted. Big SKUs matter more than small ones. Finance speaks this language.

MASE is your model’s MAE divided by a naive model’s MAE. If MASE is less than 1, your model beats the dumb baseline. If MASE is greater than 1, your model is worse than guessing.

There is also a workflow concept called Forecast Value Added (FVA). Mike Gilliland from SAS popularized it. The idea is dead simple. At every step of your forecasting process, ask “did this step make the forecast better than the previous step?” If your statistical model has worse MASE than the naive baseline, kill the model. If the demand planner’s manual adjustment makes WAPE worse than the model, kill the manual adjustment. Most teams have never asked this question. Most teams discover that one of their steps is making the forecast actively worse.

Why this matters. Picking the right metric makes the right model win automatically. Picking the wrong metric makes you defend a worse model for a year because the metric says you’re doing fine.

MAPE walking into the production environment for the last time.

Backtesting. The Skill That Saved My Career.

You cannot train-test split a time series randomly. I am begging you. Do not do this.

If you randomly split a time series into train and test sets, you’re letting the model see Tuesday so it can predict Monday. That’s cheating. Your test accuracy will look brilliant. Your production accuracy will be a graveyard.

The correct approach is walk-forward cross-validation with an expanding (or sliding) window.

The only honest way to test a sales forecast. Train on the past. Predict the next slice. Slide forward. Repeat. Mimic production.
import pandas as pd
import lightgbm as lgb
from mlforecast import MLForecast
df = pd.read_csv("sales.csv", parse_dates=["ds"])
mlf = MLForecast(models=[lgb.LGBMRegressor(n_estimators=400, num_leaves=64)],
freq="D", lags=[1, 7, 14, 28],
date_features=["dayofweek", "month"])
cv_df = mlf.cross_validation(df=df, h=28, n_windows=5, step_size=28)
def wape(g): # the metric, dollar-weighted absolute error
return (g["y"] - g["LGBMRegressor"]).abs().sum() / g["y"].abs().sum()
results = cv_df.groupby("cutoff").apply(wape)
print(results) # five WAPEs. one per fold. variance tells the story.

cv_df = mlf.cross_validation(df=df, h=28, n_windows=5, step_size=28). This runs five separate train-and-predict iterations. Each one trains on data up to a cutoff, predicts the next 28 days, advances the cutoff, repeats. Five honest measurements.

results = cv_df.groupby("cutoff").apply(wape). Computes WAPE per fold. You get five numbers.

print(results). If your folds look like 0.18, 0.21, 0.19, 0.62, 0.20, the fourth one is screaming. Investigate. Was there a promotion in that window? A holiday? A stockout? Don't average it away. The variance across folds is the actual uncertainty in your model's performance.

Why this matters. A single 80/20 split gives you one measurement. One measurement is a story, not a statistic. Five rolling folds give you a distribution. Distributions tell you when to trust the model and when to be terrified.

Walking forward in time, on rails, in chunks. The only backtest that doesn’t lie.

The Production Truths Nobody Mentions.

You shipped a model. It works. Now what.

Here is the rest of the iceberg.

Stability matters as much as accuracy. A model that has 16 percent WAPE every week is better than a model that ping-pongs between 6 and 24 percent. Operations teams plan based on your forecast. If your forecast swings, their planning swings. Their planning swinging costs money. Stability is the silent metric. Track week-over-week change of your forecast and reject models that are too erratic across runs.

The data drifts. Always. New SKUs launch. Old SKUs die. Sales teams restructure. Suppliers change. A model trained six months ago is forecasting yesterday’s business. Build a monitoring job. Compare last-month’s WAPE to the rolling 6-month average. If it crosses a threshold, retrain. Most production systems retrain weekly or monthly.

Promotions need their own model. I cannot stress this enough. If you train one model on a mix of promo and non-promo days, the model will learn an averaged blob and forecast wrong in both modes. Train a baseline model on non-promo days. Train a lift model that predicts the promo bump. Multiply. This is the structure most retail demand planning systems actually use behind the scenes.

Don’t trust the historical average for forecasts during a transition. New store openings, system migrations, COVID-style shocks. Your history is no longer your future. Use the smallest amount of recent data you can defend, and lean heavily on cross-SKU features (sales of related products in stores that aren’t transitioning).

Talk to operations. This is the unfair advantage senior data scientists have. The sales team knows that the third Wednesday of every month is slow because a competing store does a flash sale. The warehouse manager knows that 30 percent of returns happen in the first week of January. None of this is in your data. You have to ask.

The whole pipeline. The model is one box. The rest is the work.

Three War Stories. Names Removed. Lessons Earned.

The promo that swallowed the model.

A regional chain ran a 70-percent-off promo on slow-moving SKUs for four days. The model trained on the data afterward thought those SKUs had a permanent new high baseline. It over-ordered for the next six months. The fix was a binary is_promo_period flag and training a separate "lift" model that predicted only the promo bump. Lesson. Promos are a different distribution. Don't mix them.

The pipeline feature that was the future.

A B2B sales team gave us a pipeline_value feature. It looked perfect. The model got 97 percent accuracy in backtests. In production, the model was 50 percent off in the first week. Turns out the pipeline_value field was updated daily and back-filled to include closed-won deals up to 30 days retroactively. So the "pipeline" feature contained the answer. The model was memorizing the future at training time. Lesson. Audit every external feature for its true timestamp. Shift everything by at least one period. If anyone says "this feature is just current state," ask them three times.

The hierarchy that wouldn’t add up.

A manufacturing company had three forecasting teams. SKU level. Family level. Category level. Three different models. The sum of the SKU forecasts was 14 percent off the category number. The CFO noticed in a board meeting. The CFO was not pleased. We added MinT reconciliation in a week. The numbers added up. The CFO calmed down. The fix was not a better model. It was a coherence step nobody had thought to install.

Pattern across all three. The “model” was never the problem. The problem was always something around the model. Features. Pipelines. Math.

The three forecasting ghosts. They visit every team eventually. Pre-pay your therapist.

A Decision Tree. Your Cheat Sheet.

Print this. Tape it to your monitor.

1. Classify the SKU. Compute ADI and CV². Which box does it live in?
- Smooth → AutoETS or AutoARIMA. Don't overthink.
- Intermittent → CrostonSBA or TSB. Add a global LightGBM if you have many.
- Erratic → LightGBM with quantile loss. Add calendar and event features.
- Lumpy → Stop forecasting. Switch to inventory policy.

2. Check for stockouts. Mask or impute the censored days. Flag them.
3. Engineer features.
- Lags (1, 7, 14, 28, 365)
- Rolling stats
- Calendar + holidays (regional!)
- Promotions (as a flag AND a depth)
- External events you can confirm in advance
4. Train globally if you have lots of SKUs. One model, all SKUs, SKU ID
as a categorical feature.
5. Reconcile across the hierarchy with MinT.
6. Use WAPE and MASE. Never MAPE.
7. Backtest with five expanding-window folds. Look at the variance.
8. Predict quantiles (P10, P50, P90). Ship the distribution.
9. Monitor drift weekly. Retrain monthly.
10. Talk to operations. Add a feature column for the boring fact
they tell you that nobody else thought to ask about.

This is the playbook. It’s not novel. It’s not exciting. It will outperform 80 percent of the forecasting work I’ve seen, including some I’ve personally produced.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

6-Day Agentic AI Engineering Email Guide — one practical lesson per day

Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.