How to Use Only 1 Metric in AB Tests

Last Updated on November 4, 2024 by Editorial Team

Author(s): Pavel Zapolskii

Originally published on Towards AI.

The Most Important Number, or About the Main Product Metric

Imagine a product you use every day — an online store, a streaming service, or a game. How do companies improve them? What to add, remove, or where to invest more resources? To answer this question in an enterprise setting, there exists a main product metric.

In this article, we will discuss what it is, how it relates to KPIs, conducting intensive A/B tests, and other indicators.

Defining the Main Metric or Acceptance AB Metric

is a function that measures the key value of the product. It embodies the essence of the product and its impact on the business, regardless of the sector — whether e-commerce, fintech, or gamedev. For the metric to work, it must possess four important properties:

Quality — Indicates how good the product is for the customer. For example, how much time users spend on the site or how often they return (all components that make up retention).
Profitability — Don’t forget about money! The product should generate revenue, for example, through ad clicks. The metric should reflect this.
Measurability — This property is relevant for large corporations aiming to enhance their expertise in data categories. Therefore, the metric must be easily measurable to conduct A/B tests and make decisions based on it.
Interpretability — It is important that the indicator is understandable not only to data specialists but also to the business side. Therefore, it should correlate with KPIs and financial reports. If the metric is difficult to explain, it may lead to incorrect decisions.

Problems with the Key Metric

It is important to understand that the integral metric is primarily relevant for enterprise companies. Due to this specificity, a number of problems can be expected:

Multiplicity of Testing: When looking at several indicators simultaneously, the chance of error in A/B testing increases.
Low Test Density: There is a need to increase it, especially when simple techniques like CUPED are utilized.
Unclear How to Assess Effect in Controversial Situations: For example, if engagement increases but profitability decreases. What to do?

Let’s break down the last situation with an example.

Suppose we have a website where we place advertisements. If we spam it entirely with these ad banners, the revenue metric might increase: people simply cannot avoid clicking on them. However, users who find this annoying will start leaving. As a result, in the long term, the product will lose out. Therefore, it is important to maintain a balance between the profitability of a particular medium and its quality.

The main metric is also useful for maintaining a balance between product quality and monetization during monitoring.

Imagine we rolled out a patch on July 14th, and it turned out to be bad: as a result, we spammed users with offers. This happened because we forgot to conduct an A/B test. On July 20th, we noticed something was wrong when we saw that our metric went out of the confidence interval.

But agree: it would be strange to measure all indicators with confidence intervals and rolling windows. Otherwise, random alerts in the program would be unavoidable.

Thus, even here, the integral sensitive metric is missing.

In Search of a Metric | Why is it Bad to Look at GMV?

GMV (Gross Merchandise Value) is the total value of all goods sold on the platform within a unit of time.

If we choose GMV as the key metric, we can easily come up with solutions that yield short-term results but become strategically unfavorable.

For example, suppose we have a website that sells slippers 🩴. We can pile on many random offers — from different stores, with various snippets. Immediately after that, as we expect, people will start buying more because our assortment is richer. In the first month, we will thus increase GMV. However, it will then turn out that the slippers are of poor quality. Therefore, a second conversion for the same customer is unlikely 😞

A strategically sensible solution would be to move towards improving product quality: setting up filtering and other indicators. Yes, we will reduce GMV. But the user won’t encounter, for example, a toys 🔞 on the slipper page.

In Search of the Main Metric | Q&A

What should the sensitivity of the main metric be?

0.5–0.8x relative to a click (i.e., sufficiently sensitive).

What distribution is necessary?

Means are distributed normally or log-normally.

How should the metric be related to ARPU, Retention, and DAU (i.e., financial and product indicators)?

Correlated >0.7 over long distances so that there is no situation where profit or engagement indicators rise while the key metric falls.

Formulation and Training of the Main Metric

How do you find this main metric? It can be seen as a machine learning problem. We take many small metrics — clicks, time on site, transactions, and so on — and try to create a single function from them that is sensitive to changes in the product. For this, a classic method is used — linear regression. If you, like the cat below, want a bit more math, here’s what the formula looks like:

Such a sensitive metric will be called the NorthStar ✨✨✨

How to Test and Train NorthStar?

Using a dataset of experiments. Usually, it consists of 20% AA tests, 30% improving tests, and 50% degrading tests. I will explain the latter two below.

Improving Test: A test where an improvement in all key product metrics is clearly visible or a situation with average indicators at the start but high results after the release (for example, if a feature fundamentally improves the product).
Degrading Test: An experiment where we artificially remove a feature from production for a focus group to detect a decline in key indicators. For example, we reduce the quality of page loading or degrade the quality of ML models. Then we observe how dissatisfied the user becomes 😡.

How to verify that the metric works?

Use Cross-Validation

Take 80% of the data for training and 20% for testing. It is important that degrading tests are shown as red (bad) and improving tests as green (good). Also, you need to ensure that the Z-score value is low. This is a measure that helps understand how much the result deviates from the average.

2. Ensure the Metric is Linked to Real Business Indicators

For example, revenue or key performance indicators (KPIs). Imagine you tested the metric and everything looks excellent, but a few months later it turns out that the tests were actually moving in the opposite direction of the company’s goals 😱 Therefore, it is important that the metric correlates well with business indicators.

3.Eliminate the Risk of Overfitting

The metric should not depend too much on a single parameter. Stability is crucial: with slight changes in parameters, the key indicator should fluctuate minimally.

4. *Configure Predictability (Optional)

This is an additional indicator — for ML geniuses in large corporations.

There is a method that helps improve the accuracy of A/B tests using predictors — indicators that help predict the test outcome. It is important to check how well the metric can be predicted. This means that in synthetic testing, where it is not possible to clearly divide clients into different groups, we would use the causal inference technique to predict the impact of changes.

But be prepared: even after completing all the steps, the metric may be difficult to predict. And this is a signal that something needs to be improved.

The Ideal Model for NorthStar

Based on Linear Regression:

2. All Components of the Metric (clicks, time, conversions) must be strictly positive or strictly negative. For example, if improving one metric (say, an increase in clicks) enhances product quality, it is a positive component. Mark it with a plus. If a person exits the app and does not continue the conversion action — minus.

3. The Less Correlated the Components of the Metric, the Better. To better formulate NorthStar, you need to cover the entire business space of your product with a blanket of various indicators. The more distant the corners our metric sees from each other, the more accurately it reflects reality.

When things are running smoothly and your team has time to spare, why not let them do it— and then hit the conference to obtain the glory?

Mr. Zapolskii

So, What’s the Result?

Using the main metric has several advantages.

Intensive A/B Testing: Its sensitivity allows for intensive A/B testing, increasing the intensity of the indicator by 5–7 times.
Easy Interpretability: This is very important for business. It can lighten the load on the analytics team and give managers the ability to make decisions based on understandable data. Moreover, it is easy to draw conclusions about the reasons for the success or failure of a feature.
Protection from Incorrect Decisions: By choosing a working option for the long term instead of “spamming the feed with banners,” we protect ourselves from wrong decisions.

Important! To set up NorthStar, the company must have an analyst who is proficient in machine learning at an advanced level.

Implementation of the NS metric can be hindered by a poorly organized KPI that does not correlate with our key indicator.

Alternatives to the Component Metric, examples

GMV — for e-commerce. But be cautious with it (as discussed above).
A popular metric, for example, in media services is Total View Time — the total viewing time. It is good for understanding how much users like the content, but it is too susceptible to seasonality and does not always provide accurate results in tests.

🦸The main product metric is a superhero that always guards the quality and profitability of your product. It helps prevent the chase for money at the expense of the user and allows effective decision-making based on data. Find your main metric, balance quality and profitability, and move forward — towards success!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

How to Use Only 1 Metric in AB Tests

Author(s): Pavel Zapolskii

The Most Important Number, or About the Main Product Metric

Defining the Main Metric or Acceptance AB Metric

Problems with the Key Metric

Let’s break down the last situation with an example.

In Search of a Metric | Why is it Bad to Look at GMV?

In Search of the Main Metric | Q&A

Formulation and Training of the Main Metric

How to Test and Train NorthStar?

How to verify that the metric works?

The Ideal Model for NorthStar

So, What’s the Result?

Alternatives to the Component Metric, examples

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

How to Use Only 1 Metric in AB Tests

Author(s): Pavel Zapolskii

The Most Important Number, or About the Main Product Metric

Defining the Main Metric or Acceptance AB Metric

Problems with the Key Metric

Let’s break down the last situation with an example.

In Search of a Metric | Why is it Bad to Look at GMV?

In Search of the Main Metric | Q&A

Formulation and Training of the Main Metric

How to Test and Train NorthStar?

How to verify that the metric works?

The Ideal Model for NorthStar

So, What’s the Result?

Alternatives to the Component Metric, examples

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement