Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Free: 6-day Agentic AI Engineering Email Guide.
Learnings from Towards AI's hands-on work with real clients.
Why A/B Testing Fails in Two-Sided Marketplaces (and How to Fix It with Switchback Testing)
Data Science   Latest   Machine Learning

Why A/B Testing Fails in Two-Sided Marketplaces (and How to Fix It with Switchback Testing)

Author(s): Swaroop

Originally published on Towards AI.

Why A/B Testing Fails in Two-Sided Marketplaces (and How to Fix It with Switchback Testing)

By Swaroop Hadke

1. Introduction: The Hidden Trap of Network Effects

Standard A/B testing is dangerous in two-sided marketplaces. If you treat a ride-hailing app or a delivery network like a standard e-commerce store, you aren’t just getting noisy data — you are often actively cannibalizing your own network.

In a typical A/B test (say, testing a “Buy Now” button color), we rely on the SUTVA (Stable Unit Treatment Value Assumption). This is the statistical guarantee that User A’s behavior doesn’t impact User B.

In a marketplace, SUTVA effectively doesn’t exist.

Here is the scenario that breaks standard testing: You want to test a new dynamic pricing algorithm designed to lower prices and boost conversion.

  • Group A (Treatment) sees lower prices. They naturally flood the system with orders.
  • Group B (Control) sees normal prices.

Because your supply (drivers/couriers) is finite, Group A consumes all the available drivers. When Group B users try to book, they face longer wait times or “No Drivers Available” errors. Group B’s performance degrades artificially because of Group A’s activity.

If you look at the raw data, your algorithm looks like a massive success. In reality, you simply stole resources from your control group. This phenomenon is called Interference, and to solve it, we have to stop randomizing by User and start randomizing by Time.

This is where Switchback Testing becomes a requirement, not a luxury.

2. The Solution: Changing the Unit of Randomization

Switchback testing (or Time-Split testing) solves the interference problem by switching the entire marketplace between “Control” and “Treatment” at specific time intervals.

Instead of User A getting one price and User B getting another simultaneously, everyone in the marketplace gets the Treatment logic for a 30-minute window, and then everyone gets the Control logic for the next 30 minutes.

This clusters the interference within time blocks, allowing us to compare the performance of “Treatment Windows” against “Control Windows.”

Visualizing the Split

Here is how the randomization logic differs.

By isolating the variants in time, we ensure that the supply available during a “Control” period is (mostly) reflective of natural market conditions, not distorted by a simultaneous competing algorithm.

3. The Build: Simulating a Marketplace in Python

Testing experimental frameworks on live production traffic is risky (and expensive). To prove the efficacy of this architecture, I built a Python-based Experimentation Engine.

The core of this project is the MarketplaceSimulator class. It models the delicate balance of a two-sided economy:

  1. Drivers (Supply): Prefer higher earnings (Surge pricing).
  2. Users (Demand): Prefer lower costs.

The goal of the simulation was to test a “Surge Pricing” algorithm. The hypothesis? Increasing the price slightly would lower User Conversion, but increase Driver Acceptance enough to result in a higher overall Order Completion Rate (OCR).

Below is the core logic from my simulation engine. It demonstrates how the system “switches” behavior based on the time window variant.

# Simplified Logic from simulation.py

for _, row in schedule.iterrows():
variant = row['variant'] # 'Control' or 'Treatment' based on time window

# ... (Demand generation logic omitted for brevity) ...

# The Core Trade-off Mechanism
if variant == 'Treatment':
# Treatment: Surge Pricing applied
# Result: Users convert less, but Drivers accept more
price = base_price * np.random.uniform(1.0, 1.2)
driver_acceptance_prob = 0.85
user_conversion_prob = 0.70
else:
# Control: Base Pricing
# Result: Users convert more, but Drivers are pickier
price = base_price
driver_acceptance_prob = 0.75
user_conversion_prob = 0.75

# Determine the Outcome
# A ride is only 'Completed' if BOTH sides agree
driver_found = bool(np.random.random() < driver_acceptance_prob)
user_accepted = bool(np.random.random() < user_conversion_prob)

is_completed = bool(driver_found and user_accepted)

In this simulation, the Treatment creates a friction point for the user (price) to solve a friction point for the marketplace (supply availability).

4. The Math: Why Aggregation is Key

If you run this simulation for two weeks, you might generate 10,000 individual ride requests.

  • 5,000 Treatment rides
  • 5,000 Control rides

A junior Data Scientist might be tempted to run a T-test on these 10,000 rows. This is statistically invalid.

Why? Because rides happening within the same 30-minute window are autocorrelated. If it starts raining at 9:15 AM, all rides in that window are affected. They are not independent samples.

To calculate the P-value correctly, we must aggregate our data up to the Unit of Randomization — which, in Switchback testing, is the Time Window.

Instead of comparing 10,000 rides, we compare the means of the N time windows (e.g., 336 hours = 672 windows of 30 mins).

The Analysis Implementation

Here is how I implemented the aggregation and statistical testing in analysis.py. Note how we group by window_start before running the test.

from scipy import stats

def analyze_experiment(df):
# 1. CRITICAL: Aggregate metrics by the Window
# (The Unit of Randomization)
window_metrics = df.groupby(['window_start', 'variant']).agg({
'request_id': 'count',
'is_completed': 'sum',
'order_value': 'sum'
}).reset_index()

# Calculate Ratios per window (e.g., Order Completion Rate)
window_metrics['ocr'] = window_metrics['is_completed'] / window_metrics['request_id']

# 2. Separate Groups
control_windows = window_metrics[window_metrics['variant'] == 'Control']
treatment_windows = window_metrics[window_metrics['variant'] == 'Treatment']

# 3. Statistical Test (Welch's t-test on Window Means)
# We test the means of the *windows*, not the individual rides.
t_stat, p_value = stats.ttest_ind(
treatment_windows['ocr'],
control_windows['ocr'],
equal_var=False
)

return p_value, window_metrics

By averaging the metrics within each window, we satisfy the independence assumption required for the Welch’s t-test.

5. Visualizing the Results

To make these results accessible to stakeholders, the final step of the project was piping these metrics into a Streamlit dashboard.

The dashboard visualizes:

  1. Global Lift: The percentage improvement in Order Completion Rate (OCR) and Gross Merchandise Value (GMV).
  2. Significance: A clear indicator of whether the P-value is < 0.05.
  3. Time Series: A chart showing how the metrics fluctuated over the 14-day period.

6. Advanced Considerations & Limitations

While Switchback testing solves the interference problem, it introduces a new one: Carryover Effects.

If you surge prices at 9:55 AM (Treatment) to attract drivers, those drivers are likely still in the area at 10:05 AM, when the system switches back to normal prices (Control). The “Control” window benefits from the supply accumulated by the “Treatment” window.

The Fix: Sophisticated experimentation platforms implement a “Burn-in” (or Washout) Period. When calculating metrics, we drop the data from the first 5–10 minutes of every window. This allows the marketplace state to reset and stabilize before we start measuring the impact of the new variant.

7. Conclusion: Moving Beyond “Move Fast and Break Things.”

Building this simulation highlighted a critical shift in how we approach data science in mature tech sectors. In the early days, simple predictive modeling and “move fast” A/B testing were enough to capture low-hanging fruit.

Today, the margin for error in marketplaces is razor-thin. Algorithms that manage pricing and dispatching are too interconnected to test with naive splitting methods. We need to move toward Causal Inference engines that respect the physics of supply and demand.

The Python code I’ve shared here — specifically the shift from row-level analysis to window-aggregated analysis — isn’t just a statistical safeguard; it’s a requirement for validity. If you can’t trust your Control group, you can’t trust your Lift. And in a business operating at scale, a false positive on a pricing algorithm isn’t just a bad experiment — it’s millions in lost revenue.

This article is based on a portfolio project demonstrating Causal Inference in Python. The full source code for the simulation and analysis engine is available on my GitHub.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

6-Day Agentic AI Engineering Email Guide — one practical lesson per day

Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.