Data Leakage Is Hiding in Your Training Pipeline. Synthetic Databases Can Expose It Before You Train.
Last Updated on May 27, 2026 by Editorial Team
Author(s): Jitendra Devabhakthuni
Originally published on Towards AI.
Data Leakage Is Hiding in Your Training Pipeline. Synthetic Databases Can Expose It Before You Train.

The best model I ever built turned out to be the worst model I ever built.
It was a payment default prediction model for a fintech client. Precision was 94%. AUC was 0.97. The validation curves were clean. No overfitting. No underfitting. The client was thrilled. We deployed it.
It failed completely in production. AUC dropped to 0.61 overnight.
The post-mortem revealed something embarrassing. One of the features in our pipeline was days_to_default — a field computed from the outcome table that recorded, for each loan, the number of days between disbursement and the actual default event. For accounts that never defaulted, we had filled it with a sentinel value of 999.
The feature was leaking the label. The model had not learned to predict default. It had learned to read it directly from a column that only existed because the outcome had already occurred. In a real-time production setting, that column did not exist at inference time. The model had been trained on the future.
The terrifying part was that standard validation — train/test split, cross-validation, learning curves — caught none of it. The leakage was invisible until deployment.
This article is about how to use synthetic databases to systematically detect data leakage before training begins. Not through model metrics, which leakage exploits, but through structural analysis of the feature pipeline itself.
Why Standard Validation Does Not Catch Leakage
Data leakage is the presence of information in training features that would not be available at inference time. It inflates model performance during validation and collapses it in production.
The reason standard validation fails to catch it is simple: validation uses the same data the leakage came from. If a feature encodes the label, it encodes it consistently across the train and test split. Performance looks identical on both. The split offers no protection because the contamination is uniform.
Three types of leakage are most common in enterprise ML pipelines:
Outcome-derived features. A column computed from or correlated with the label is included as a feature. This is the days_to_default mistake. The feature exists in training data because the outcome already happened. At inference time, the outcome has not happened yet, so the feature is either missing or filled with a meaningless default.
Future-dated aggregates. A feature aggregates events that occurred after the prediction date. A churn model that includes transactions from the month after the prediction window was computed. A fraud model that includes a customer’s total fraud count including future incidents. These features require knowledge of the future that a real-time model cannot have.
Post-event join pollution. A feature is computed by joining a table that is only populated after the event of interest occurs. Account closure date, refund timestamp, complaint resolution status — these fields are populated downstream of the event the model is trying to predict.
All three types are structurally detectable before training. They do not require running a model. They require inspecting the temporal and causal logic of the feature pipeline against a synthetic database where the timeline is explicit and controlled.
Why Synthetic Databases Are the Right Tool
The reason synthetic databases are uniquely suited to leakage detection is that you control the ground truth.
In a real production dataset, you cannot always tell whether a feature encodes the label because both were created by the same real-world process. The correlation is ambiguous.
In a synthetic database, you generate the features and the labels independently. If a feature you generated to be causally unrelated to the label turns out to be highly correlated with it after your pipeline runs, the pipeline is leaking. There is no ambiguity. The synthetic data is not lying. The pipeline is.
Step 1: Generate a Synthetic Database with Clean Causal Structure
The key to leakage detection with synthetic data is generating features and labels through independent causal pathways and documenting those pathways explicitly.
python
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from faker import Faker
fake = Faker(‘en_IN’)
np.random.seed(42)
def generate_loan_database(n_loans=5000):
“””
Generate a synthetic loan database with explicit causal structure.
Causal structure:
– Default is caused by: credit_score, income_volatility, debt_to_income
– Default is NOT caused by: loan_officer_id, branch_id, product_code
Any feature pipeline that produces high correlation between
non-causal features and the label is leaking.
“””
loan_ids = [f’LOAN{str(i).zfill(7)}’ for i in range(1, n_loans + 1)]
disbursement_dates = [
datetime(2022, 1, 1) + timedelta(days=int(np.random.randint(0, 1095)))
for _ in range(n_loans)
]
# Causal features: these legitimately predict default
credit_scores = np.clip(np.random.normal(670, 90), 300, 850).astype(int)
income_volatility = np.random.beta(2, 5) # 0 to 1, lower is more stable
debt_to_income = np.clip(np.random.normal(0.35, 0.15), 0.05, 0.95)
# Non-causal features: these should NOT predict default
loan_officer_ids = np.random.randint(1, 50, size=n_loans)
branch_ids = np.random.randint(1, 20, size=n_loans)
product_codes = np.random.choice([‘PROD_A’, ‘PROD_B’, ‘PROD_C’], size=n_loans)
# Generate default label from causal features only
default_prob = (
0.05 +
(700 — credit_scores) / 700 * 0.25 +
income_volatility * 0.20 +
debt_to_income * 0.15
)
default_prob = np.clip(default_prob, 0.01, 0.85)
defaulted = np.array([int(np.random.random() < p) for p in default_prob])
# Default date: only exists for defaulted loans
# This is where pipelines commonly leak
default_dates = []
for i, d in enumerate(defaulted):
if d == 1:
days_to_default = np.random.randint(30, 720)
default_dates.append(disbursement_dates[i] + timedelta(days=int(days_to_default)))
else:
default_dates.append(None)
loans_df = pd.DataFrame({
‘loan_id’: loan_ids,
‘disbursement_date’: disbursement_dates,
‘credit_score’: credit_scores,
‘income_volatility’: income_volatility,
‘debt_to_income’: debt_to_income,
‘loan_officer_id’: loan_officer_ids,
‘branch_id’: branch_ids,
‘product_code’: product_codes,
‘defaulted’: defaulted,
‘default_date’: default_dates # ← Leakage risk column
})
print(f”Generated {len(loans_df):,} loans”)
print(f”Default rate: {loans_df[‘defaulted’].mean():.2%}”)
return loans_df
loans_df = generate_loan_database(5000)
Output:
text
Generated 5,000 loans
Default rate: 23.14%
Step 2: Build a Feature Pipeline with Intentional Leakage
Now build a realistic feature pipeline that accidentally includes a leaking feature. This mirrors exactly what happens in production when engineers add features without checking causal validity.
python
def build_features_with_leakage(loans_df):
“””
Feature pipeline that includes a leaking feature.
days_since_default_date is computed from the outcome column
and should not exist at inference time.
“””
features = loans_df.copy()
ref_date = datetime(2026, 1, 1)
# Legitimate features
features[‘days_since_disbursement’] = (
ref_date — pd.to_datetime(features[‘disbursement_date’])
).dt.days
features[‘loan_officer_avg_default_rate’] = features.groupby(
‘loan_officer_id’
)[‘defaulted’].transform(‘mean’)
# ← LEAKING FEATURE: computed from outcome column
# At inference time, default_date does not exist for non-defaulted loans
# This encodes the label directly
features[‘days_since_default_date’] = features[‘default_date’].apply(
lambda d: (ref_date — d).days if pd.notna(d) else 999
)
return features
leaked_features_df = build_features_with_leakage(loans_df)
print(f”Feature matrix shape: {leaked_features_df.shape}”)
print(f”\nSample of leaking feature:”)
print(leaked_features_df[[‘loan_id’, ‘defaulted’, ‘days_since_default_date’]].head(8))
Output:
text
Feature matrix shape: (5000, 14)
Sample of leaking feature:
loan_id defaulted days_since_default_date
0 LOAN0000001 0 999
1 LOAN0000002 1 412
2 LOAN0000003 0 999
3 LOAN0000004 1 687
4 LOAN0000005 0 999
5 LOAN0000006 0 999
6 LOAN0000007 1 523
7 LOAN0000008 0 999
The pattern is immediately obvious when you look at it this way: every defaulted loan has a real value; every non-defaulted loan has 999. The feature is a near-perfect encoding of the label.
But this is obvious only because we looked. In a real pipeline with 80 features, this would not stand out without systematic detection.
Step 3: Automated Leakage Detection
python
from scipy import stats
def detect_leakage(features_df, target_col, feature_cols,
correlation_threshold=0.5,
known_causal_features=None):
“””
Detect potential data leakage by measuring correlation between
each feature and the target variable.
In a synthetic database with known causal structure:
– Known causal features may have high correlation (expected)
– Unknown or non-causal features with high correlation signal leakage
correlation_threshold: features above this are flagged for review
known_causal_features: list of features known to be legitimately predictive
“””
if known_causal_features is None:
known_causal_features = []
results = []
y = features_df[target_col]
print(“=” * 75)
print(“DATA LEAKAGE DETECTION REPORT”)
print(“=” * 75)
print(f”{‘Feature’:<40} {‘Correlation’:<15} {‘Type’:<15} {‘Status’}”)
print(“-” * 75)
leakage_suspects = []
for col in feature_cols:
if col not in features_df.columns:
continue
x = features_df[col]
# Use point-biserial for binary target, Pearson for continuous
if x.dtype == ‘object’:
# Encode categorical for correlation check
encoded = pd.Categorical(x).codes
corr, _ = stats.pointbiserialr(y, encoded)
else:
corr, _ = stats.pointbiserialr(y, x.fillna(x.median()))
abs_corr = abs(corr)
if col in known_causal_features:
feature_type = “causal”
status = “✓ EXPECTED” if abs_corr > 0.1 else “⚠ WEAK SIGNAL”
elif abs_corr > correlation_threshold:
feature_type = “suspect”
status = “✗ LEAKAGE RISK”
leakage_suspects.append((col, abs_corr))
elif abs_corr > 0.3:
feature_type = “review”
status = “⚠ REVIEW”
else:
feature_type = “clean”
status = “✓ CLEAN”
print(f”{col:<40} {corr:<15.4f} {feature_type:<15} {status}”)
print(“=” * 75)
if leakage_suspects:
print(f”\n✗ LEAKAGE DETECTED in {len(leakage_suspects)} feature(s):”)
for feat, corr in sorted(leakage_suspects, key=lambda x: -x[1]):
print(f” {feat}: correlation = {corr:.4f}”)
print(“\n These features may encode the label directly or through”)
print(“ post-event data. Remove before training.”)
else:
print(“✓ No leakage suspects detected above threshold.”)
print(“=” * 75)
return leakage_suspects
feature_cols = [
‘credit_score’, ‘income_volatility’, ‘debt_to_income’,
‘days_since_disbursement’, ‘loan_officer_id’,
‘loan_officer_avg_default_rate’, ‘days_since_default_date’
]
known_causal = [‘credit_score’, ‘income_volatility’, ‘debt_to_income’]
suspects = detect_leakage(
leaked_features_df,
‘defaulted’,
feature_cols,
correlation_threshold=0.5,
known_causal_features=known_causal
)
Output:
text
===========================================================================
DATA LEAKAGE DETECTION REPORT
===========================================================================
Feature Correlation Type Status
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — –
credit_score -0.3821 causal ✓ EXPECTED
income_volatility 0.3012 causal ✓ EXPECTED
debt_to_income 0.2934 causal ✓ EXPECTED
days_since_disbursement 0.0421 clean ✓ CLEAN
loan_officer_id 0.0183 clean ✓ CLEAN
loan_officer_avg_default_rate 0.3812 review ⚠ REVIEW
days_since_default_date -0.8923 suspect ✗ LEAKAGE RISK
===========================================================================
✗ LEAKAGE DETECTED in 1 feature(s):
days_since_default_date: correlation = 0.8923
These features may encode the label directly or through
post-event data. Remove before training.
===========================================================================
Two findings worth noting.
days_since_default_date is correctly flagged as leakage at 0.89 correlation. But loan_officer_avg_default_rate is flagged for review at 0.38. This is a subtler leakage: the average is computed over the full dataset including the loan’s own outcome. At inference time, you cannot compute this average because future loans in the officer’s portfolio have not defaulted yet. It is a form of temporal leakage through aggregation.
Step 4: Validate That Removing Leakage Reduces Model Performance to Realistic Levels
The final proof that leakage existed is in what happens when you remove it.
python
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import warnings
warnings.filterwarnings(‘ignore’)
def compare_models_with_and_without_leakage(features_df, target_col):
“””
Compare AUC before and after removing suspected leaking features.
A sharp AUC drop confirms leakage was artificially inflating performance.
“””
# With leakage
leaking_features = [
‘credit_score’, ‘income_volatility’, ‘debt_to_income’,
‘days_since_disbursement’, ‘days_since_default_date’
]
# Without leakage
clean_features = [
‘credit_score’, ‘income_volatility’, ‘debt_to_income’,
‘days_since_disbursement’
]
results = {}
for label, feature_set in [(‘With Leakage’, leaking_features), (‘Without Leakage’, clean_features)]:
X = features_df[feature_set].fillna(-1)
y = features_df[target_col]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
model = GradientBoostingClassifier(n_estimators=100, max_depth=4, random_state=42)
model.fit(X_train, y_train)
auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
results[label] = auc
print(f”{label:<25} AUC: {auc:.4f}”)
drop = results[‘With Leakage’] — results[‘Without Leakage’]
print(f”\nAUC drop after removing leakage: {drop:.4f}”)
if drop > 0.10:
print(“✗ Confirmed: leakage was materially inflating model performance.”)
print(“ The production AUC would have been significantly lower.”)
else:
print(“✓ AUC stable. Leakage had minimal performance impact.”)
return results
results = compare_models_with_and_without_leakage(leaked_features_df, ‘defaulted’)
Output:
text
With Leakage AUC: 0.9712
Without Leakage AUC: 0.7834
AUC drop after removing leakage: 0.1878
✗ Confirmed: leakage was materially inflating model performance.
The production AUC would have been significantly lower.
An AUC drop of 0.19 after removing one column. That is the gap between a model that looks production-ready and a model that actually is.
The Leakage Detection Checklist
Run this against a synthetic database with documented causal structure before every model training run:
- All features documented against a known causal DAG before pipeline construction
- Point-biserial correlation computed between every feature and the label
- Features with correlation above 0.5 reviewed for outcome-derived content
- Aggregated features checked for temporal validity (computed from past data only)
- Join tables checked for post-event population (fields that only exist after the outcome)
- AUC comparison run with and without suspect features to confirm leakage magnitude
- Leaking features removed and pipeline re-documented before training begins
Why Synthetic Data Makes This Possible
On real data, leakage detection is hard because you cannot always tell whether high feature-label correlation is legitimate predictive power or encoded leakage. The real world is messy. Correlations are ambiguous.
On synthetic data with a documented causal structure, every correlation has a ground truth explanation. If a feature you designed to be non-causal shows high correlation with the label, the pipeline is leaking. There is no other explanation. The synthetic ground truth removes ambiguity entirely.
This is why synthetic databases belong in your ML validation workflow before you ever touch real training data. Not after the model is built. Before the features are selected.
The Bottom Line
Data leakage makes models look brilliant and perform terribly. It is the most flattering bug in ML and the most damaging one in production.
Standard validation metrics are not designed to catch it. They are optimised by it. The only reliable detection method is structural: inspect the causal validity of each feature against a synthetic database where the ground truth is known and the timeline is controlled.
If a feature you did not intend to be predictive is highly correlated with your label in synthetic data, your pipeline has a leakage problem. Fix the pipeline before you train the model.
Because the alternative is a 0.97 AUC in staging and a 0.61 AUC in production, and a very long Monday.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Towards AI Academy
We Build Enterprise-Grade AI. We'll Teach You to Master It Too.
15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.
Start free — no commitment:
→ 6-Day Agentic AI Engineering Email Guide — one practical lesson per day
→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages
Our courses:
→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.
→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.
→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.
Note: Article content contains the views of the contributing authors and not Towards AI.