Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Our 15 AI experts built the most comprehensive, practical, 90+ lesson courses to master AI Engineering - we have pathways for any experience at Towards AI Academy. Cohorts still open - use COHORT10 for 10% off.

Publication

From A/B Testing to DoubleML: A Data Scientist’s Guide to Causal Inference:
Artificial Intelligence   Data Science   Latest   Machine Learning

From A/B Testing to DoubleML: A Data Scientist’s Guide to Causal Inference:

Last Updated on September 23, 2025 by Editorial Team

Author(s): Rohit Yadav

Originally published on Towards AI.

From A/B Testing to DoubleML: A Data Scientist’s Guide to Causal Inference:
Image by Author

This article is a comprehensive guide to the most common causal inference techniques, complete with practical examples and code. While the scenarios are inspired by real-world use cases I have worked on, the data has been synthetically generated for clear and reproducible demonstrations.

Introduction: Why Causal Inference Matters?

Imagine your team has launched a new “smart reply” feature for customer support aimed at reducing the average handle time(AHT) for a ticket . A quarter later, the main dashboard is glowing as the metric(AHT) has dropped by 18%. The team is celebrating, but a sharp question arises in the business review: “How do we know the feature caused this drop? Seasonality and other business changes also happened in that period.”

This is one of the most critical questions you can face. Answering it requires moving beyond correlation to establish causation. This article is a guide to the essential causal inference techniques you can use to answer these questions with confidence.

First, The Core Concepts:

Before opening the toolbox, let’s understand the two fundamental challenges we’re up against:

1. The Counterfactual Problem

The core challenge of causal inference is that we can never observe what would have happened in an alternate universe. If a user gets a new feature and buys a product, we will never know if they would have bought it anyways without getting the new feature. This unobserved outcome is called the counterfactual. Our goal is to use data to create the most reasonable estimate of this counterfactual. The causal effect is the difference between what happened and what would have happened in the counterfactual scenario.

Representation of causal effect (Image by Author)

2. Confounding Variables

When you see a feature and an outcome moving together, it’s tempting to connect them. But most of the times, there is always a hidden third factor which we call as confounding variable influencing both. For example, more experienced agents might be more likely to adopt a new tool and hence have lower handle times. Is the tool making them faster, or are faster agents just more likely to use the tool? Experience is the confounder. Our methods are designed to control for these hidden factors.

Correlation vs. Causation: The Trap

It’s very common to equate correlation with causation, but this is a classic pitfall we all should be careful of. An observed correlation between a treatment and an outcome doesn’t prove the treatment caused the outcome.

The Core Causal Inference Toolbox

My first step in any analysis like this is to establish the gold standard. In a perfect world, we would have run an A/B Test, or a Randomized Controlled Trial (RCT), from the very beginning.

1. A/B Testing (Randomized Controlled Trials)

  • When would you use this? Whenever you can. If you are launching a new feature, product, or campaign, this is the most reliable way to measure its true impact.
  • The Core Idea (In simple words): Think of it as a clean scientific experiment. You randomly assign users to a “treatment” group (who get the feature) or a “control” group (who don’t). This randomization washes out all the hidden factors, meaning any difference between the groups can be confidently attributed to your feature.
Randomized Control Trial for A/B Testing (Image by Author)
  • Code in Practice: Since we couldn’t turn back time, I used a simulation to show the product team what a clean A/B test would have looked like. This helped set the stage for the methods we could use.
import pandas as pd
import numpy as np
from scipy import stats

#Simulating a simple A/B test
np.random.seed(42)
n_agents = 500
group = np.random.choice(['Control', 'Treatment'], size=n_agents)
base_aht = 300
treatment_effect = -25
noise = np.random.normal(0, 50, n_agents)

aht = np.where(group == 'Control', base_aht + noise, base_aht + treatment_effect + noise)
df_ab = pd.DataFrame({'agent_id': range(n_agents), 'group': group, 'aht': aht})

#Analyzing the results
control_mean = df_ab[df_ab['group'] == 'Control']['aht'].mean()
treatment_mean = df_ab[df_ab['group'] == 'Treatment']['aht'].mean()
ate = treatment_mean - control_mean

print(f"Control Group AHT: {control_mean:.2f} seconds")
print(f"Treatment Group AHT: {treatment_mean:.2f} seconds")
print(f"Estimated Impact (ATE): {ate:.2f} seconds")

t_stat, p_value = stats.ttest_ind(
df_ab[df_ab['group'] == 'Treatment']['aht'],
df_ab[df_ab['group'] == 'Control']['aht']
)
print(f"P-value: {p_value:.3f}")

2. Difference-in-Differences (DiD)

With the ideal scenario established, the real investigation began. I started digging through old project logs, looking for a “natural experiment” and found that the tool was enabled for our US agents a full month before our EU team. This was my opening.

  • When would you use this? When a feature or policy was rolled out to one group but not another, and you have data from before and after the rollout.
  • The Core Idea (In simple words): In our “smart reply” case, the staggered rollout between the US and EU teams was a perfect scenario for DiD. We used the EU team as a baseline to understand the seasonal trends affecting everyone. By comparing the US team’s improvement to the EU team’s change over the same period, we could isolate the tool’s true effect.
  • Key Assumption: The critical assumption is “parallel trends” i.e. both groups were on a similar trajectory before the change happened. Always plot your data to visually check this.
Difference in Differences Visual Representation (Image by Author)
  • Code in Practice:
import statsmodels.formula.api as smf

#Simulating DiD data
np.random.seed(42)
df_did = pd.DataFrame({
'unit': np.repeat(range(200), 10), 'time': np.tile(range(10), 200),
'treated': np.repeat(np.arange(200) < 100, 10)
})
df_did['post'] = (df_did['time'] >= 5).astype(int)
df_did['treat_post'] = df_did['treated'] * df_did['post']
df_did['aht'] = 300 - df_did['time']*2 + np.random.normal(0, 10, df_did.shape[0])
true_did_effect = -15
df_did.loc[df_did['treat_post'] == 1, 'aht'] += true_did_effect

#Fit the DiD regression model
model = smf.ols('aht ~ treated + post + treat_post', data=df_did)
results = model.fit()
print(results.summary().tables[1])

3. Propensity Score Matching (PSM)

But the story didn’t end there. Within the US team, some agents used the tool constantly while others barely touched it. And, predictably, the heaviest users were often our most senior agents. This raised a new, trickier question i.e. was the tool making the agents better, or were the better agents just more likely to use the tool? This is a classic challenge called selection bias, which stems from hidden confounding variables (like experience) that influence both tool adoption and performance.

  • When would you use this? When users can choose to opt-in to a feature, creating selection bias.
  • The Core Idea (In simple words): Think of it as a matchmaking service for your data. For every user who opted in, you find their “statistical twin” — a user who didn’t opt in but looked identical on key characteristics (like tenure or past performance).
  • Key Assumption: The main assumption is “un-confoundedness” i.e. you have measured all the key variables that influence a user’s choice to opt-in. This is a strong assumption.
Propensity Score Matching (PSM) Visual Representation

Method:

1. Estimate the propensity score for all units.

2. Match treated units to control units with similar propensity scores.

3. Estimate the treatment effect by comparing outcomes within the matched pairs.

  • Code in Practice:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors

#Simulate data where experienced agents are more likely to adopt the tool
np.random.seed(42)
n_agents = 500
#Agent experience (confounder) affects both adoption and handle_time
experience = np.random.uniform(0.5, 5, n_agents)
#Probability of adopting the tool increases with experience
prob_adopt = 1 / (1 + np.exp(-(1 * experience - 2)))
adopted_tool = np.random.binomial(1, prob_adopt, n_agents)

#Base handle_time decreases with experience
base_handle_time = 350 - 15 * experience + np.random.normal(0, 20, n_agents)
true_effect = -20
#Final handle time includes the tool's true effect
handle_time = base_handle_time + adopted_tool * true_effect

df_psm = pd.DataFrame({'experience': experience, 'adopted_tool': adopted_tool, 'handle_time': handle_time})

#Estimate propensity scores
ps_model = LogisticRegression().fit(df_psm[['experience']], df_psm['adopted_tool'])
df_psm['propensity_score'] = ps_model.predict_proba(df_psm[['experience']])[:, 1]

#Match agents
treated = df_psm[df_psm['adopted_tool'] == 1]
control = df_psm[df_psm['adopted_tool'] == 0]
nn = NearestNeighbors(n_neighbors=1, algorithm='ball_tree').fit(control[['propensity_score']])
distances, indices = nn.kneighbors(treated[['propensity_score']])
matched_control = control.iloc[indices.flatten()]

#Estimate effect
att_estimate = (treated['handle_time'].mean() - matched_control['handle_time'].mean())
print(f"Estimated ATT via PSM: {att_estimate:.2f} seconds")

4. Double Machine Learning (DML)

This investigation was revealing that impact is rarely a single number; it’s a collection of stories. For our most complex problems, with hundreds of variables about agents and customers, I turn to more modern methods.

  • When would you use this? When you have a complex problem with many potential confounding variables and suspect their relationships are non-linear.
  • The Core Idea (In simple words): DML uses two machine learning models to clean up the noise from confounders. One model predicts the outcome, and another predicts the treatment. It then cleverly estimates the causal effect on the information that’s left over.

Method:

1. Predict the outcome Y from covariates X (using ML).

2. Predict the treatment T from covariates X (using ML — like propensity score).

3. Estimate the treatment effect using a model on the residuals from these predictions. This “orthogonalization” step makes the estimate less sensitive to small errors in the ML models. Often involves cross-fitting to prevent overfitting.

  • Code in Practice:
#!pip install econml -- uncomment this line and run if econml package is not installed.

from econml.dml import LinearDML
from sklearn.ensemble import RandomForestRegressor

#Simulate complex data for support tickets
np.random.seed(42)
n_agents = 1000
#Confounders (X)
experience = np.random.uniform(1, 10, n_agents)
ticket_priority = np.random.choice([1, 2, 3], n_agents)
#Treatment (T): using the smart reply tool
used_smart_reply = np.random.binomial(1, 1 / (1 + np.exp(-(0.5 * experience - 1))), n_agents)
#Outcome (Y): handle_time
true_effect = -25
handle_time = 350 - 10 * experience - 5 * ticket_priority + np.random.normal(0, 30, n_agents)
handle_time += used_smart_reply * true_effect

X = pd.DataFrame({'experience': experience, 'ticket_priority': ticket_priority})
Y = handle_time
T = used_smart_reply

#Fit a DML model
#Use ML models to estimate the effect of T on Y, controlling for X
est = LinearDML(model_y=RandomForestRegressor(), model_t=RandomForestRegressor())
est.fit(Y, T, X=X)

#Get the estimated effect
treatment_effect_dml = est.effect(X)
print(f"\nEstimated DML Effect: {np.mean(treatment_effect_dml):.2f} seconds")

The Playbook: When to Choose Your Method

To summarize, here is a practical playbook for choosing the right tool for your problem:

Image by Author

Conclusion

Moving from correlation to causation is a critical step in maturing as a data scientist. While A/B testing remains the gold standard, it’s not always an option. Understanding observational methods like DiD, PSM, and DML equips you with a powerful toolbox to answer the “why” behind your data. By carefully choosing the right tool for your problem and being honest about its assumptions, you can deliver more credible, strategic, and valuable insights.

References

Excellent Python Packages :

Books That Shaped My Thinking:

  • Causal Inference: The Mixtape by Scott Cunningham.
  • Causal Inference: What If by Hernan and Robins.

Thank you for reading. I’ve done my best to explain these concepts clearly with consistency, but if you spot any errors, I’d appreciate you letting me know.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Discover Your Dream AI Career at Towards AI Jobs

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!

Note: Content contains the views of the contributing authors and not Towards AI.