From A/B Testing to DoubleML: A Data Scientist’s Guide to Causal Inference:

Last Updated on September 23, 2025 by Editorial Team

Author(s): Rohit Yadav

Originally published on Towards AI.

From A/B Testing to DoubleML: A Data Scientist’s Guide to Causal Inference: — Image by Author

This article is a comprehensive guide to the most common causal inference techniques, complete with practical examples and code. While the scenarios are inspired by real-world use cases I have worked on, the data has been synthetically generated for clear and reproducible demonstrations.

Introduction: Why Causal Inference Matters?

Imagine your team has launched a new “smart reply” feature for customer support aimed at reducing the average handle time(AHT) for a ticket . A quarter later, the main dashboard is glowing as the metric(AHT) has dropped by 18%. The team is celebrating, but a sharp question arises in the business review: “How do we know the feature caused this drop? Seasonality and other business changes also happened in that period.”

This is one of the most critical questions you can face. Answering it requires moving beyond correlation to establish causation. This article is a guide to the essential causal inference techniques you can use to answer these questions with confidence.

First, The Core Concepts:

Before opening the toolbox, let’s understand the two fundamental challenges we’re up against:

1. The Counterfactual Problem

The core challenge of causal inference is that we can never observe what would have happened in an alternate universe. If a user gets a new feature and buys a product, we will never know if they would have bought it anyways without getting the new feature. This unobserved outcome is called the counterfactual. Our goal is to use data to create the most reasonable estimate of this counterfactual. The causal effect is the difference between what happened and what would have happened in the counterfactual scenario.

Representation of causal effect (Image by Author)

2. Confounding Variables

When you see a feature and an outcome moving together, it’s tempting to connect them. But most of the times, there is always a hidden third factor which we call as confounding variable influencing both. For example, more experienced agents might be more likely to adopt a new tool and hence have lower handle times. Is the tool making them faster, or are faster agents just more likely to use the tool? Experience is the confounder. Our methods are designed to control for these hidden factors.

Correlation vs. Causation: The Trap

It’s very common to equate correlation with causation, but this is a classic pitfall we all should be careful of. An observed correlation between a treatment and an outcome doesn’t prove the treatment caused the outcome.

The Core Causal Inference Toolbox

My first step in any analysis like this is to establish the gold standard. In a perfect world, we would have run an A/B Test, or a Randomized Controlled Trial (RCT), from the very beginning.

1. A/B Testing (Randomized Controlled Trials)

When would you use this? Whenever you can. If you are launching a new feature, product, or campaign, this is the most reliable way to measure its true impact.
The Core Idea (In simple words): Think of it as a clean scientific experiment. You randomly assign users to a “treatment” group (who get the feature) or a “control” group (who don’t). This randomization washes out all the hidden factors, meaning any difference between the groups can be confidently attributed to your feature.

Randomized Control Trial for A/B Testing (Image by Author)

Code in Practice: Since we couldn’t turn back time, I used a simulation to show the product team what a clean A/B test would have looked like. This helped set the stage for the methods we could use.

import pandas as pd
import numpy as np
from scipy import stats

#Simulating a simple A/B test
np.random.seed(42)
n_agents = 500
group = np.random.choice(['Control', 'Treatment'], size=n_agents)
base_aht = 300 
treatment_effect = -25
noise = np.random.normal(0, 50, n_agents)

aht = np.where(group == 'Control', base_aht + noise, base_aht + treatment_effect + noise)
df_ab = pd.DataFrame({'agent_id': range(n_agents), 'group': group, 'aht': aht})

#Analyzing the results
control_mean = df_ab[df_ab['group'] == 'Control']['aht'].mean()
treatment_mean = df_ab[df_ab['group'] == 'Treatment']['aht'].mean()
ate = treatment_mean - control_mean

print(f"Control Group AHT: {control_mean:.2f} seconds")
print(f"Treatment Group AHT: {treatment_mean:.2f} seconds")
print(f"Estimated Impact (ATE): {ate:.2f} seconds")

t_stat, p_value = stats.ttest_ind(
 df_ab[df_ab['group'] == 'Treatment']['aht'],
 df_ab[df_ab['group'] == 'Control']['aht']
)
print(f"P-value: {p_value:.3f}")

2. Difference-in-Differences (DiD)

With the ideal scenario established, the real investigation began. I started digging through old project logs, looking for a “natural experiment” and found that the tool was enabled for our US agents a full month before our EU team. This was my opening.

When would you use this? When a feature or policy was rolled out to one group but not another, and you have data from before and after the rollout.
The Core Idea (In simple words): In our “smart reply” case, the staggered rollout between the US and EU teams was a perfect scenario for DiD. We used the EU team as a baseline to understand the seasonal trends affecting everyone. By comparing the US team’s improvement to the EU team’s change over the same period, we could isolate the tool’s true effect.
Key Assumption: The critical assumption is “parallel trends” i.e. both groups were on a similar trajectory before the change happened. Always plot your data to visually check this.

Difference in Differences Visual Representation (Image by Author)

Code in Practice:

import statsmodels.formula.api as smf

#Simulating DiD data
np.random.seed(42)
df_did = pd.DataFrame({
 'unit': np.repeat(range(200), 10), 'time': np.tile(range(10), 200),
 'treated': np.repeat(np.arange(200) < 100, 10)
})
df_did['post'] = (df_did['time'] >= 5).astype(int)
df_did['treat_post'] = df_did['treated'] * df_did['post']
df_did['aht'] = 300 - df_did['time']*2 + np.random.normal(0, 10, df_did.shape[0])
true_did_effect = -15
df_did.loc[df_did['treat_post'] == 1, 'aht'] += true_did_effect

#Fit the DiD regression model
model = smf.ols('aht ~ treated + post + treat_post', data=df_did)
results = model.fit()
print(results.summary().tables[1])

3. Propensity Score Matching (PSM)

But the story didn’t end there. Within the US team, some agents used the tool constantly while others barely touched it. And, predictably, the heaviest users were often our most senior agents. This raised a new, trickier question i.e. was the tool making the agents better, or were the better agents just more likely to use the tool? This is a classic challenge called selection bias, which stems from hidden confounding variables (like experience) that influence both tool adoption and performance.

When would you use this? When users can choose to opt-in to a feature, creating selection bias.
The Core Idea (In simple words): Think of it as a matchmaking service for your data. For every user who opted in, you find their “statistical twin” — a user who didn’t opt in but looked identical on key characteristics (like tenure or past performance).
Key Assumption: The main assumption is “un-confoundedness” i.e. you have measured all the key variables that influence a user’s choice to opt-in. This is a strong assumption.

Propensity Score Matching (PSM) Visual Representation

Method:

1. Estimate the propensity score for all units.

2. Match treated units to control units with similar propensity scores.

3. Estimate the treatment effect by comparing outcomes within the matched pairs.

Code in Practice:

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors

#Simulate data where experienced agents are more likely to adopt the tool
np.random.seed(42)
n_agents = 500
#Agent experience (confounder) affects both adoption and handle_time
experience = np.random.uniform(0.5, 5, n_agents)
#Probability of adopting the tool increases with experience
prob_adopt = 1 / (1 + np.exp(-(1 * experience - 2)))
adopted_tool = np.random.binomial(1, prob_adopt, n_agents)

#Base handle_time decreases with experience
base_handle_time = 350 - 15 * experience + np.random.normal(0, 20, n_agents)
true_effect = -20
#Final handle time includes the tool's true effect
handle_time = base_handle_time + adopted_tool * true_effect

df_psm = pd.DataFrame({'experience': experience, 'adopted_tool': adopted_tool, 'handle_time': handle_time})

#Estimate propensity scores
ps_model = LogisticRegression().fit(df_psm[['experience']], df_psm['adopted_tool'])
df_psm['propensity_score'] = ps_model.predict_proba(df_psm[['experience']])[:, 1]

#Match agents
treated = df_psm[df_psm['adopted_tool'] == 1]
control = df_psm[df_psm['adopted_tool'] == 0]
nn = NearestNeighbors(n_neighbors=1, algorithm='ball_tree').fit(control[['propensity_score']])
distances, indices = nn.kneighbors(treated[['propensity_score']])
matched_control = control.iloc[indices.flatten()]

#Estimate effect
att_estimate = (treated['handle_time'].mean() - matched_control['handle_time'].mean())
print(f"Estimated ATT via PSM: {att_estimate:.2f} seconds")

4. Double Machine Learning (DML)

This investigation was revealing that impact is rarely a single number; it’s a collection of stories. For our most complex problems, with hundreds of variables about agents and customers, I turn to more modern methods.

When would you use this? When you have a complex problem with many potential confounding variables and suspect their relationships are non-linear.
The Core Idea (In simple words): DML uses two machine learning models to clean up the noise from confounders. One model predicts the outcome, and another predicts the treatment. It then cleverly estimates the causal effect on the information that’s left over.

Method:

1. Predict the outcome Y from covariates X (using ML).

2. Predict the treatment T from covariates X (using ML — like propensity score).

3. Estimate the treatment effect using a model on the residuals from these predictions. This “orthogonalization” step makes the estimate less sensitive to small errors in the ML models. Often involves cross-fitting to prevent overfitting.

Code in Practice:

#!pip install econml -- uncomment this line and run if econml package is not installed.

from econml.dml import LinearDML
from sklearn.ensemble import RandomForestRegressor

#Simulate complex data for support tickets
np.random.seed(42)
n_agents = 1000
#Confounders (X)
experience = np.random.uniform(1, 10, n_agents)
ticket_priority = np.random.choice([1, 2, 3], n_agents)
#Treatment (T): using the smart reply tool
used_smart_reply = np.random.binomial(1, 1 / (1 + np.exp(-(0.5 * experience - 1))), n_agents)
#Outcome (Y): handle_time
true_effect = -25
handle_time = 350 - 10 * experience - 5 * ticket_priority + np.random.normal(0, 30, n_agents)
handle_time += used_smart_reply * true_effect

X = pd.DataFrame({'experience': experience, 'ticket_priority': ticket_priority})
Y = handle_time
T = used_smart_reply

#Fit a DML model
#Use ML models to estimate the effect of T on Y, controlling for X
est = LinearDML(model_y=RandomForestRegressor(), model_t=RandomForestRegressor())
est.fit(Y, T, X=X)

#Get the estimated effect
treatment_effect_dml = est.effect(X)
print(f"\nEstimated DML Effect: {np.mean(treatment_effect_dml):.2f} seconds")

The Playbook: When to Choose Your Method

To summarize, here is a practical playbook for choosing the right tool for your problem:

Conclusion

Moving from correlation to causation is a critical step in maturing as a data scientist. While A/B testing remains the gold standard, it’s not always an option. Understanding observational methods like DiD, PSM, and DML equips you with a powerful toolbox to answer the “why” behind your data. By carefully choosing the right tool for your problem and being honest about its assumptions, you can deliver more credible, strategic, and valuable insights.

References

Excellent Python Packages :

Books That Shaped My Thinking:

Causal Inference: The Mixtape by Scott Cunningham.
Causal Inference: What If by Hernan and Robins.

Thank you for reading. I’ve done my best to explain these concepts clearly with consistency, but if you spot any errors, I’d appreciate you letting me know.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

From A/B Testing to DoubleML: A Data Scientist’s Guide to Causal Inference:

Author(s): Rohit Yadav

Introduction: Why Causal Inference Matters?

First, The Core Concepts:

1. The Counterfactual Problem

2. Confounding Variables

The Core Causal Inference Toolbox

1. A/B Testing (Randomized Controlled Trials)

2. Difference-in-Differences (DiD)

3. Propensity Score Matching (PSM)

4. Double Machine Learning (DML)

The Playbook: When to Choose Your Method

Conclusion

References

Excellent Python Packages :

Books That Shaped My Thinking:

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

I Built a Clinical AI Agent — and It Skipped the Tools I Gave It

ATOKEN: A Unified Tokenizer for Vision Finally Solves AI’s Biggest Problem

How to Model APIs with Ontologies and Graphs for AI Agents

From A/B Testing to DoubleML: A Data Scientist’s Guide to Causal Inference:

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

From A/B Testing to DoubleML: A Data Scientist’s Guide to Causal Inference:

Author(s): Rohit Yadav

Introduction: Why Causal Inference Matters?

First, The Core Concepts:

1. The Counterfactual Problem

2. Confounding Variables

The Core Causal Inference Toolbox

1. A/B Testing (Randomized Controlled Trials)

2. Difference-in-Differences (DiD)

3. Propensity Score Matching (PSM)

4. Double Machine Learning (DML)

The Playbook: When to Choose Your Method

Conclusion

References

Excellent Python Packages :

Books That Shaped My Thinking:

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement