Products

Solutions

Resources

Docs Pricing

Products

Solutions

Resources

Products

Solutions

Resources

Introducing Differential Impact Detection

Wed Jul 17 2024

Differential Impact Detection

Statsig can now automatically surface heterogenous treatment effects across your user properties.

In experimentation “one size fits all” is not always true. Data scientists often look for actionable insights in cases where segments of users are responding to an experiment in different ways. With Differential Impact Detection, Statsig makes it easier than ever to observe cases of Heterogeneous Treatment Effects.

What are Heterogeneous Treatment Effects and why do we care?

Heterogeneous Treatment Effect happens when different sub-populations in an experiment respond to the same treatment in a significantly different way.

Example of Heterogeneous Treatment Effect

Example of Homogeneous Treatment Effect

How does our feature help solve this

There is an intrinsic trade-off in analyzing HTE. On one hand, there can be actionable insights from diving deeper into subgroups. On the other hand, dividing up your user base into subpopulations cuts down your experimental power so this isn’t the most useful way to readout experiments. This also means that bias can be introduced by cherry-picking.

However, automated HTE detection can be very useful to an advanced experimentation team. We developed this procedure for experiments where HTE detection is used:

Investigate the top sub-populations across each user property that you specify as a “Segment of Interest”
For each primary metric in the experiment, determine if any sub-population has a different response to treatment
Automatically surface a visualization of metrics sliced by user segments where one or more sub-population behaves significantly differently from the rest of the population
Apply Bonferroni correction to control for multiple comparison (check implementation details at the end)

In short, this feature would automatically surface HTE based on the user properties you deem interesting, and present the corrected p-value. It’s your decision what to do with this information.

More understanding of HTE

There are two main reasons why one treatment would have different effects on different groups of people:

1 - Different groups of people have different reactions to the same treatment.

For example, the addition of a new and improved tutorial might be very helpful for new users but have no impact for tenured users.

Often these user properties are either self-identified by users (like language or country) or reflect their usage patterns before being enrolled the experiment (like tenure, number of comments, or usage segments)

2- The treatment may manifest differently in different use cases.

For example, there might be a bug in the test arm of the experience that only happens for Firefox users.

Any specific surface could have a bug, introduce a performance degradation, or display your product differently such that the user experience is meaningfully different from the user experience on other surfaces in the same treatment group. Often this can occur on a particular device, browser, os, or version - which can all be configured as Segments of Interest to automatically detect differences in behavior.

Getting Started on Statsig

You can set up your “Segments of Interest” at a project level, which will be the user properties where heterogeneous treatment effects are automatically surfaced. We have two levels of alerting: “high likelihood” and “some likelihood” which have different sensitivities.

Implementation Details

We use a Welch’s t-test to compare the average treatment effect for a particular user property value to the average treatment effect for all other users. We do this iteratively for each user property vs the rest of the population, so highly correlated user properties will be surfaced independently. Since the methodology is straightforward, understanding how and why a certain segment has been flagged is relatively easy.

We use a Bonferroni correction on the number of evaluations we’re making to determine a p-value threshold for high likelihood (p < 0.01/n) or some likelihood (p < 0.05/n) that there is heterogenous effects in a given user property. We do this so that this alert doesn’t become too noisy and prone to false positives when making multiple comparisons.

Get started now!

Get started for free. Add your whole team!

Permalink: https://www.statsig.com/blog/differential-impact-detection

Products

Solutions

Resources

Products

Solutions

Resources

Docs

Pricing

Back to Blog home

Liz Obermaier

Introducing Differential Impact Detection

Differential Impact Detection

What are Heterogeneous Treatment Effects and why do we care?

How does our feature help solve this

More understanding of HTE

Getting Started on Statsig

Implementation Details

Further Reading

Get started now!

Recent Posts

Optimizing cloud compute costs with GKE and compute classes

Pablo Beltran

How Statsig lets you ship, measure, and optimize AI-generated code

Sid Kumar, Brock Lumbard

Your users are your best benchmark: a guide to testing and optimizing AI products

Skye Scofield

The more the merrier? The problem of multiple comparisons in A/B Testing

Allon Korem, Oryah Lancry-Dayan

Randomization: The ABC’s of A/B Testing

Allon Korem, Oryah Lancry-Dayan

Speeding up A/B tests with discipline

Yuzheng Sun, PhD