How to conduct an A/A test (and interpret the results!)

Fri Dec 15 2023

Tyler VanHaren

Software Engineer, Statsig

Surprise—it’s all lime!

You may remember the famous Jello experiment from elementary school. It goes like this:

Students are each given a selection of different colored Jello cups. One at a time, the teacher asks them to eat one color, and then record what flavor it was.

Reds usually get labeled as strawberry or cherry. Green as lime. Blue as blue raspberry. And so on.

At the end of the experiment, the teacher reveals that it was actually a lesson in bias, and all the Jello was the same flavor, and just colored differently using food coloring. It’s usually a fun experiment—being duped into eating Jello is an easy hoax to swallow, especially as a kid.

jello cubes 2

Ronny Kohavi, the former head of Microsoft’s experimentation team, does a similar version of this.

Ronny has been known to conduct a training exercise wherein he presents his group with the results of an experiment and asks everyone to make a decision on whether or not to ship the change evidenced in the experiment. Naturally, the group will read into the metrics and make compelling cases for or against.

Then he makes the big reveal: It was an A/A test all along, and both sides of the test were receiving the exact same treatment. No underlying metric lift at all.

Why A/A test

A/A testing is useful in understanding if your experimentation platform (and underlying stats engine) are behaving as expected—for example, how often you’re seeing false positives.

It’s also useful, though, in both understanding metrics better, and running more effective experiment reviews. What variance is implicit in your metrics? How correlated are they with one another? What changes are detectable?

How you should A/A test

It’s important to note that you can do either an offline or online A/A test.

In an offline A/A test, you randomly split the user data, and analyze those two groups. In an online A/A test, you run a real test with production code and metrics, but both groups of users see the same experience.

Offline tests are nice because they can be fully automated (Statsig can do this for you using the data you send us) and give you a picture of how well-behaved your metrics are. Online tests, though, offer a much broader look at your experimentation platform—your assignment mechanism, logging of both exposures and metrics, and the underlying stats engine.

We recommend running several online tests (around 5 or 10) to gather more data. You should also be thinking about what metrics you want to measure as an experimentation program: Are they heavily skewed (revenue often is), or do they conform to a very nice normal distribution?

Interpreting A/A test results

One common misconception about A/A tests is that there will be no statistically significant results at all. After all, there was no difference between your groups.

However, any useful metric will have some variance, so when we average across an arbitrary cut, there will be some difference in means.

Looking at the results of your test, you’ll see confidence intervals representing the level of uncertainty in measuring those differences. A 95% confidence interval represents the range which, over multiple trials, 95% of the time contains the true effect. This means that, 5% of the time, your true effect will be outside that range.

In the case of an A/A test, these will show up as statistically significant results. These are known as false positives: When the difference between the 2 groups appears statistically significant even when there is no true difference.

This expected false positive rate is entirely dictated by the significance level you choose: a 95% confidence interval is equal to a significance level of 0.05, or a 5% false positive rate. If you’re using the industry standard 95% confidence interval, you should always expect that 5% of the time, the true impact of your experiment is outside of that range. Therefore, even in the case of a true 0% lift, we will still have confidence intervals that do not include 0 around 5% of the time, resulting in statistically significant results.

Analyzing an A/A test shares many similarities with analyzing an A/B test, with the same underlying principles applying to both. As you collect more data points over the course of the experiment, you’ll generally receive a more accurate depiction of the true underlying effect.

However, it's important to remember that this increase in data points does not change your false positive rate. (Note that for a z-test’s assumptions to be valid, a minimum number of users (~100 per group) is needed, so treat this as you would a normal experiment in terms of data collection.

Consider the analogy of flipping a coin. After 10 flips, having 7 heads is fairly plausible, with about a 17% chance of occurring. However, getting 70 heads after 100 flips is incredibly unlikely with a fair coin (around 0.004% chance).

A more comparable outcome to 7 heads in 10 flips would be around 56 heads in 100 flips. The key takeaway is this: whether you're analyzing 100 or 1000 flips, if your threshold for significance is set to capture results occurring 5% of the time, you'll see those results 5% of the time. In other words, with a 95% confidence interval, 5% of coins will appear biased, regardless of the number of flips you selected to use.

That doesn’t mean that there’s nothing to be gleaned from these results, though—especially if you’ve run many tests as suggested above. You’ll have more confidence in these results aggregated across multiple iterations, rather than just one big test. Over all those tests, the raw count of the number of times something was significant should align with that 5% mark.

Another interesting result is if two metrics always have similar lifts to one another across tests, that could indicate that they are highly correlated. For example, a user who purchased more items might be more likely to have added more items to their cart.

There are also a few reasons a particular metric may not behave well, and A/A tests can help catch and highlight these. For example, highly skewed metrics (such as revenue) do not converge to a mean as easily as binomial metrics such as click-through-rate. When you observe one of these metrics being statistically significant at an abnormally high rate in your tests, it’s worth considering if you’ll want to develop other proxy metrics for use in experiments.

Join the Slack community

Connect with our data scientists and engineers, and ask questions or just hang out with other cool folks that believe in an experimentation culture!
join slack community cta image

Things to watch for

An online A/A test can reveal many issues you may experience in your full system.

You should validate that the users are getting assigned to test vs. control in an even way. We recommend doing a chi-squared test on how many users were in each group to validate it lines up with your expected split (Statsig always does one of these as a health check—visible under Diagnostics).

An issue here means your assignment is not being fairly split, or there is an issue in getting that assignment information back to your experimentation platform.

In an A/A test, the distribution of p-values observed should be uniformly distributed from 0 to 1 for well-behaved metrics. In his book, Trustworthy Online Controlled Experiments, Ronny Kohavi makes the fascinating observation that for an individual metric, a clumping around a p-value of 0.32 can indicate that it’s heavily impacted by outliers.

Another sanity check that can help you feel comfortable that data is flowing end to end through your system is comparing the raw numbers from metrics in your test to those in other logging you already have in place. Does the number of users who were exposed to the test align with your existing expectations? Do metric values meet expectations?

Applying learnings to A/B tests

I find it useful to consider how the results of your A/A test inform what results you should be looking for in a real test.

If A/A tests frequently show lifts greater than 10% for a specific metric, consider questioning the reliability of measuring things that hypothesize to only cause a 0.5% lift to that metric. Exceedingly rare events are much harder to measure changes in behavior for, so consider other metrics that can serve as more sensitive indicators.

A/A tests, and their 5% of false positives, are also why we recommend keeping the list of metrics you expect to move short and to the point. If you select 20 metrics, something is likely to be statistically significant, so keep the list you’d ship on short to avoid cherry-picking.

All in all, learn from Ronny’s exercise, and don’t be the person lauding the benefits of shipping an A/A test!

Request a demo

Statsig's experts are on standby to answer any questions about experimentation at your organization.
request a demo cta image

Try Statsig Today

Get started for free. Add your whole team!
We use cookies to ensure you get the best experience on our website.
Privacy Policy