The ML Evaluation Math You Can Actually Trust

Last Updated on January 2, 2026 by Editorial Team

Author(s): Akshat shah

Originally published on Towards AI.

Train/Val/Test, Cross-Validation, and Data Leakage

The ML Evaluation Math You Can Actually Trust — Photo by Thomas T on Unsplash

Machine learning isn’t just “a model that predicts things.” In the real world it’s a measurement process. You build a pipeline, and you try to measure how well it will perform on future data you haven’t seen yet.

Train/validation/test splits and cross-validation are pretty much the core tools we use to make that measurement statistically significant. Data leakage is what happens when we accidentally corrupt that measurement. Usually in subtle ways that still produce impressive numbers… but unforunately isn’t what we want. This article aims to explain the evaluation workflow throughs a mathematical lens, but at the same time I’ll try to keep it as ML-101 friendly as possible. I’ll make sure that every term gets defined, and every equation gets unpacked until it’s meaning because obvious.

The Goal: Estimating Generalization

Let’s set up the supervised learning problem. You have a dataset of n examples, written as ordered pairs: (xᵢ, yᵢ) for i = 1…n

Here:

xᵢ is the feature vector for an example data point i. A feature is a measurable input variable you use to make a prediction (square footage, number of bedrooms, neighborhood, etc.). Formally, we often treat xᵢ as a point in ℝᵖ. All this means is xᵢ has p numerical components: xᵢ ∈ ℝᵖ
yᵢ is the label (also called the target) for example i. A label is the ground-truth output you want the model to learn to predict (like the sale price of the house). For regression tasks, yᵢ is typically real-valued: yᵢ ∈ ℝ

A model is a function f_θ(x) that maps features to a prediction: ŷ = f_θ(x)

The symbol θ denotes the parameters of the model; these are the internal values the training algorithm is allowed to adjust. In a supervised learning model like linear regression, θ would be the vector of weights and a bias term; in a neural network, θ is a large collection of weights across layers. To judge whether predictions made by our model are “good,” we need a loss function ℓ(⋅ , ⋅). A loss function takes a true label y and a predicted label ŷ, and returns a nonnegative number measuring error. For a simpler algorithm like linear regression, common losses include:

squared error: ℓ(y, ŷ) = (y− ŷ)² , or absolute error: ℓ(y, ŷ) = ∣y − ŷ∣

At this point, we can write down the key quantity we care about: the model’s expected error on new data. The phrase “new data” actually has a technical meaning: it refers to new examples drawn from the same underlying population as your dataset. We represent that population as a probability distribution 𝒟 over pairs (x, y)

The true (population) risk of parameters θ is:

Gotta do this since math typesetting sucks on Medium. Sorry! | Source

Every term matters:

(x, y)∼𝒟(x, y) means “(x, y) is drawn from the real-world data-generating distribution.” Referring to our running house price example, this is a mathematical way of saying: a future house you haven’t seen yet.
E[⋅] is an expectation, i.e., an average over many hypothetical future draws.
ℓ(y, f_θ(x)) is the loss incurred on one future example.

This is the quantity you wish you could compute directly. The key word here is WISH… you cannot actually compute it since 𝒟 is unknown. You only have a finite sample (your dataset), and that is why we need evaluation protocols.

Key Insight: The “real” objective is the population risk ℛ(θ): expected error on future data. Unfortunately, we can’t directly compute it. Thus, everything about train/val/test is an attempt to estimate it without cheating (leakage).

Training: replacing the unknown world with the data you have

Since ℛ(θ) is unknown, training substitutes it with a computable approximation using the dataset. This is called empirical risk minimization.

The empirical risk (training error) is

Epirical Risk — Notice the hat over the R, and Expectation replaced with average sum | Source

Here’s what each piece means:

The summation adds up the loss across all training examples.
The factor 1/n makes it an average loss per example.
Importantly, the pairs (xᵢ, yᵢ) are the data you already observed.

Training then chooses parameters that minimize this empirical risk:

Find the optimal *θ (i.e the optimal parameters that minimze risk)* | Source

Note: The symbol arg⁡min⁡_ θ means “the value of θ that makes the expression as small as possible.”

Intuitively, this equation says: find the model parameters that make the model’s predictions match the training labels as closely as possible under the chosen loss function.

Equivalently: find the model parameters that minimize the loss averaged across all of your training labels.

However, minimizing training loss alone does not guarantee good future performance. A model can do extremely well on the training set while doing poorly on new data. This phenomenon is called overfitting. The model learned patterns that are specific to the training sample (including random noise), rather than properly generalizing.

Key Insight: Training is an optimization problem on observed data. Generalization is a statistical claim about unseen data. Those are not the same thing. A well trained model might be overfitted and struggle with generalization.

Why one dataset is not enough: the role of train/validation/test

Once you accept that training loss can be misleading, you need an evaluation design that separates three distinct activities: learning the parameters, making modelling decisions, and producing an unbiased final evaluation of performance. That separation is exactly what the train/validation/test split accomplishes.

Training set (parameter estimation)

The training set is the subset of the data used to fit θ̂ (sorry about the formatting… medium really makes it tough). In other words, the sum in the training objective is computed over training examples only. If the training index set is T ⊂ {1,…,n} you can write:

Formula for computing θ̂ *using training data |* *Source*

Here |T| means “number of training examples.”

Validation set (model selection)

The validation set is used for model selection. You do NOT train on this data. It is used to make design decisions like selecting the overall pipeline design, including hyperparameters.

A hyperparameter is a configuration choice that is not learned by minimizing the loss directly but instead is chosen externally. Examples include the regularization strength λ in ridge regression, the maximum depth in a decision tree, the learning rate in gradient descent, or even which features to include. Hyperparameters matter because they control the model’s capacity and therefore affect overfitting.

If the validation index set is V, the validation error for parameters trained on T is:

Computing **risk** on our validation set *using training data |* *Source*

Notice what changed: the parameters θ̂ were learned from T, but we compute the loss on V, which contains examples the model did not train on.

That “not trained on” property is crucial: it makes validation performance a rough indicator for future performance.

Test set (final evaluation)

The test set is reserved for a single purpose: producing a final estimate of generalization performance after all decisions have been fixed. If you keep checking the test set while iterating on the model, then the test set is no longer a neutral judge… it becomes part of the decision process, and your reported metric becomes optimistic.

The core idea is that evaluation must mimic deployment: in deployment, you will see new examples once, and you will not get to redesign your pipeline after seeing the outcomes.

Key Insight: Validation influences decisions; therefore it cannot be the final judge. The test set exists to be untouched by model selection.

Cross-validation: averaging over splits to reduce randomness

Unfortunately, a single train/validation split can be noisy. If your dataset is small, the validation performance may depend heavily on which examples happen to land in the validation set. Cross-validation reduces this dependence by repeating the evaluation across multiple splits.

Defining k-fold cross-validation

In k-fold cross-validation, we partition the dataset indices into k disjoint subsets (folds) F₁, F₂, …, Fₖ such that:

Fⱼ ∩ Fₘ = ∅ for j ≠ m
the folds are roughly equal in size.
AND:

For each fold j, we train on all data except Fⱼ, and we validate on Fⱼ. Let the training indices for fold j be T^(j) = {1,…,n} \ Fⱼ. Then we train:

**Fold j training:** minimize average loss over T^(j) | Source

Now compute the validation loss on the held-out fold:

**Validation Loss** on the specific fold Fⱼ | Source

Finally, the cross-validation estimate is the average across folds:

Average across all the folds to get the **cross validation estimate** | Source

This equation is often written compactly, but the meaning is straightforward: we simulate k different “unseen validation sets,” and we average the resulting errors. That makes the estimate less sensitive to a lucky or unlucky split.

How Cross Validation is used in practice

You’re probably wondering after all that math how to practically use Cross Validation. Cross-validation is typically used to choose hyperparameters. Suppose you have a set of candidate hyperparameters Λ (for example, λ ∈ {0.01,0.1,1,10}). For each λ, you compute the LHS in the equation above and then choose:

Pick the set of **hyperparameters** that minimizes loss across **k-fold cross validation** | Source

Key Insight: Cross-validation does not “train a better model” by itself. It produces a better measurement of performance by averaging over multiple train/validate splits. We can use these results to intelligently choose hyperparameters and other details about our model.

Data leakage: a precise definition and why it breaks the math

Finally, we can now rigorously define data leakage. Recall that Data leakage occurs when information that would not be available at prediction time influences training or evaluation. This causes the measured validation/test error to underestimate the true risk ℛ(θ). In other words, leakage makes your evaluation no longer an honest estimate of future performance in real world usage. Leakage comes in a few major forms.

(a) Preprocessing leakage (statistics computed using non-training data)

Many preprocessing steps are themselves learned from data. For instance, standardization transforms each feature to have mean zero and a variance of 1:

Here μ (mean) and σ (standard deviation) are not constants. They are estimated from data. If you compute μ and σ using the entire dataset (including validation/test), then your training pipeline has absorbed information about the distribution of the unseen data. Your evaluation is now “easier” than real life, because in real life you would not know those future statistical details at training time.

The correct protocol instead is:

split dataset into train/val/test,
compute μ and σ using training data only,
apply the same transform to validation/test.

The same logic applies to most transformations that “fit” something.

(b) Target leakage (features that encode the label)

Target leakage is when a feature contains the answer, directly or indirectly. In house pricing, a feature like “final assessed tax value” might be computed using the sale price, making it a disguised version of the target. The model can achieve extremely low error, but only because it is using information that would not exist at prediction time.

Formally, if a feature x depends on y in a way that violates the causal direction available at prediction time, then you are no longer modelling p(y∣x) under the deployment, real-world setting.

(c) Split leakage (same entity appears in train and test)

If near-duplicate examples appear across splits (the same house listed multiple times, the same user appearing in multiple rows, the same patient contributing multiple samples), then the model can partially memorize the entity and appear to generalize. This is why grouped splitting (keeping all samples from an entity in one split) is often necessary.

(d) Time leakage (training uses information from the future)

If the data is time-dependent, random splits can place future examples in training and past examples in validation. That creates evaluation conditions that are impossible in deployment, where you always train on the past and predict the future. Proper evaluation uses time-ordered splits.

Key Insight: Leakage is much more than “a small bug.” It changes the meaning of your validation/test loss, turning it into an estimate of performance in a world where your model has extra information.

The final, correct workflow

Here is a one-liner that defines our Machine Learning train/val/test workflow: “Every step that learns from data was fit only on the training portion, evaluation data was untouched by model selection, and the testing data is completely untouched.”

You can begin by defining the deployment scenario, primarily what counts as “future” and what information will be available at prediction (inference) time. You then split the data accordingly. Random splits are fine when examples are independent and identically distributed (i.i.d., meaning each sample is drawn from the same distribution and does not influence the others), but grouped or time-based splits are necessary when that dependence in the dataset does exist.

With the split fixed, you build the entire preprocessing-to-model pipeline so that all fitted components (scalers, imputers, encoders, feature selectors) are trained on the training set only. Hyperparameter tuning is done using validation or cross-validation strictly within the training data. Finally, once the pipeline design is determined, you evaluate one time on the test set and treat that number as your estimate of generalization.

This is what makes the estimate of ℛ(θ) trustworthy, rather than it being a number inflated by accidental access to hidden information.

Key Insight: A trustworthy metric is not produced by a “good model.” It is produced by a protocol that preserves the separation between learning, choosing, and judging.

Takeaways

Train/val/test and cross-validation exist for one reason: to make your metric mean “this will work on new data,” not “this fit my dataset.” Train is where you learn parameters, validation (or CV) is where you choose the pipeline and hyperparameters, and test is the final reality check that must stay untouched. Data leakage is anything that breaks that contract we’ve defined. Letting any future or hidden information seep into training/evaluation makes your evaluation score become a fantasy number from an easier world. The ML litmus test as I like to call it is brutal but perfect: could you have known this at prediction time? If not, your evaluation is lying.

As Always, keep Learning

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

The ML Evaluation Math You Can Actually Trust

Author(s): Akshat shah

Train/Val/Test, Cross-Validation, and Data Leakage

The Goal: Estimating Generalization

Training: replacing the unknown world with the data you have

Why one dataset is not enough: the role of train/validation/test

Training set (parameter estimation)

Validation set (model selection)

Test set (final evaluation)

Cross-validation: averaging over splits to reduce randomness

Defining k-fold cross-validation

How Cross Validation is used in practice

Data leakage: a precise definition and why it breaks the math

The final, correct workflow

Takeaways

Training vs Testing vs Validation Sets – GeeksforGeeks

Your All-in-One Learning Portal: GeeksforGeeks is a comprehensive educational platform that empowers learners across…

Machine Learning Q&A: All About Model Validation

The articles in this Q&A series will look at a topic, explain some of the background, and answer a few questions that…

Train Test Validation Split: How To & Best Practices [2024]

The train test validation split is a technique for partitioning data into training, validation, and test sets. Learn…

A Gentle Introduction to k-fold Cross-Validation – MachineLearningMastery.com

Cross-validation is a statistical method used to estimate the skill of machine learning models. It is commonly used in…

3.1. Cross-validation: evaluating estimator performance

Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model…

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement