Embrace Overlapping A/B Tests and Avoid the Dangers of Isolating Experiments

Timothy Chan
Fri Oct 08 2021
AB-TESTING SOFTWARE-DEVELOPMENT ANALYTICS EXPERIMENTATION DATA-SCIENCE
It’s Laundry! Photo by engin akyurt on Unsplash

At Statsig, I’ve had the pleasure of meeting many experimentalists from different backgrounds and experiences. How to handle simultaneous experiments frequently comes up. Many experimentalists prefer isolating experiments to avoid collisions and maintain experimental integrity. Yet my experience as a Data Scientist at Facebook (a company that ran 10k+ simultaneous experiments) tells me this worry is overblown and can seriously restricts a company’s pace of experimentation and innovation, while still producing bad decisions.

A lot of people say you should wash your whites and colors separately. But if you want to wash other kinds of laundry, you have to separate the whites, light colors, darks, different fabric types, delicates, and all permutations in between. Alternatively, if you just mix them up and wash them together, you could be done with your laundry quickly. You only have to trust that the detergent works.

I’m here to tell you that for A/B testing and overlapping experiments, the detergent works.

Managing Multiple Experiments

There are several strategies for managing multiple experiments:

  1. Sequential testing — Experiments are ran sequentially, one after the other. You get the full power of your userbase but it takes more time. Sometimes you will have to delay the second experiment if the first one hasn’t finished.
  2. Overlapping testing — Experiments are ran simultaneously over the same set of users. All combinations of test and control are experienced by different users. This method maintains experimental power, but can reduce accuracy (as I’ll explain later, this is not a major drawback).
  3. Isolated tests — Users are split into segments, and each user is enrolled in only one experiment. Your experimental power is reduced, but accuracy of results are maintained.
  4. A/B/n testing — This method launches a joint experiment simultaneously. Experimental power is only slightly reduced as the control group is reused, but the two experiments must be launched and concluded together.
  5. Multivariate testing — This is similar to the overlapping testing, but the experimental analysis involves comparing all combinations of test and control against each other. This maximizes experimental power and accuracy but makes the analysis more complex. This method does not work well if you want to run 3 or more tests (8 variations).

There are tradeoffs between accuracy, experimental power (ie. speed), and scalability. Isolated and overlapping testing are generally preferred and are quite scalable, allowing teams to run dozens of simultaneous experiments. But between the two, there are tradeoffs between speed and accuracy. The heart of the issue is that overlapping experiments maximizes experimental power for every experiment, but can introduce noise as multiple experimental effects push and pull on the metrics. Additionally, the test and control experiences aren’t as concretely defined and it’s less clear what you’re measuring; This makes interpretation of the experiment a little more complicated.

Furthermore, as companies scale experimentation, they will quickly encounter coordination challenges. Isolated testing will inevitably require that an experiment finish to free up users before a new test can be allocated and launched. This can introduce a bottleneck where teams are force experiments to prematurely finish, while delaying the launch of others. It’s worth noting that with isolated experiments, the residual effect from a previous experiment can sometimes affect the experimental results of the next (ie. a hangover effect).

When to Isolate Experiments

Isolating experiments is a useful tool in experimentation. There are situations where it is critical:

  1. Overlap is Technically Impossible: It’s sometimes impossible to overlap multiple experiences to a single user. For example, one cannot test new ranking algorithm A (vs control) and also new ranking algorithm B (vs control) in overlapping tests. As ranking algorithms are commonly the sole decider of ranking and it’s not possible to let A and B run simultaneously.
  2. Strong Interaction Effects are Expected: Some experimental effects are non-linear; ie. 1 + 1 ≠ 2. If effect A is +1% and effect B is +1%, sometimes combining them can lead to 0%, or +5%. Such effects can skew the readout if they’re not isolated. But as I’ll address later, isolating experiments to avoid non-linear effects can lead to wrong decisions.
  3. Breaking the User Experience: Some combinations of experiments can break the user experience and these should be avoided. For example, a test that moves the “Save File” Button to a menu, and another test that simplifies the UI by hiding the menu can combine to confuse users.
  4. A Precise Measurement is the Goal. Sometimes getting an accurate read on an experimental effect is critical and it’s not enough to simply know an effect is good or bad. At Youtube, just knowing that ads negatively affect watch time is insufficient; accurately quantifying the tradeoff is vital to understanding the business.

Interaction Effects are Often Overblown

Running overlapping experiments can increase variance and skew results, but from my experience at Facebook, strong interaction effects are rare. It is more common to find smaller non-linear results. While these can skew the final numbers, it’s rare to find two experiments collide to produce outright misleading results. Effects are generally additive which leads to clean “ship or no ship” decisions. Even when numbers are skewed, they are in the same ballpark and result in the same decisions; overlapping experiments can be trusted to generate directionally accurate results.

In all honesty, since most companies are continuously shipping features, it’s not possible from independent isolated tests to know the true combined effect. Getting an accurate read is better left to long-term holdouts. Furthermore, interaction effects that lead to broken experiences are fairly predictable. Broken experiences or strong interaction effects typically occur at the surface level, eg. a single web page or a feature within an app. At most companies, these are usually controlled by a single team and teams are generally aware of what experiments are running and planned. I have found that interaction effects are easily anticipated and can be managed at the team level.

Isolating Experiments Slows You Down

“Our success is a function of how many experiments we do per year, per month, per week, per day.”
— Jeff Bezos

It’s well known that companies like Facebook, Google, Microsoft, Netflix, AirBnB and Spotify run thousands of simultaneous experiments. Their success and growth come from their speed of optimization and hypothesis testing. But even with billions of users, running 10k experiments only averages 100k users per experiment. And as I’ve described previously, these companies are hunting for sub-percentage wins (<1%): 100k users is simply not enough. Suddenly their experimental superpower (pun intended) is dramatically reduced. How do these companies do it?

At Facebook, each team independently managed their own experiments and rarely coordinated across the company. Every user, ad, and piece of content was part of thousands of simultaneous experiments. This works because strong interaction effects are quite rare and typically occur at the feature-level where it’s often managed by a single team. This allows teams to increase their experimental power to hundreds of millions of users, ensuring experiments run quicker and features are optimized faster.

Overlapping experiments does sacrifice precision. However last month’s precise experimental results have probably lost their precision by now. Seasonal effects can come into play, and your users are dynamically changing from external effects. And if you’ve embraced rapid optimization, your product is also changing. In almost all cases, precise and accurate measurements are just an illusion.

For the vast majority of experiments you only need to determine whether an experiment is “good“ or “not good” for your users and business goals. Ship vs no-ship decisions are binary and directional data is sufficient. Getting an accurate read on an experimental effect is secondary. Most non-linear effects either dampen or accentuate your effects, and but generally do not affect the direction.

The Fallacy of Controlling Every Effect

The secret power of randomized controlled experiments is that the randomization controls all known and unknown confounding effects. Just like how randomization controls for demographic, seasonal, behavioral and external effects, randomization also controls the effects of other experiments. Attempting to isolate experiments for the sake of isolation is in my opinion, a false sense of control. The following quote from Lukas Vermeer sums this up well.

“Consider this: your competitor is probably also running at least one test on his site, and a) his test(s) can have a similar “noise effect” on your test(s) and b) the fact that the competition is not standing still means you need to run at least as many tests as he is just to keep up.
The world is full of noise. Stop worrying about it and start running as fast as you can to outpace the rest!”
— Lukas Vermeer, Director of Experimentation at Vistaprint

As Lukas pointed out, your competitors are surely running experiments on mutual customers and worrying about these effects will only slow you down.

Embracing Interaction Effects in Rapid Experimentation

Hypothetical Button/Background Test with a bad interaction effect

I advocate for embracing interaction effects when doing rapid experimentation. Let’s say you have a team running a blue/red button test, and someone else running a blue/grey background test. In this case, there is a clear interaction effect where blue button and a blue background leads to a broken experience, and both experiments will show blue underperforming. Isolating these experiments can maintain the integrity of results, and produce a clean readout. But I would argue that if both teams decide blue is best and ship blue, you’ll end up with a disastrous result where despite great effort you haven’t escaped the interaction effect… in fact you fell right into the trap. Meanwhile if you simply ran overlapping experiments, you would have ended up with a better result even if it’s suboptimal (both teams would have avoided Blue).

Best Practices for Overlapping Experiments

Top companies with a fully integrated experimentation culture generally overlap experiments by default. This unleashes their full experimental power on every experiment, allowing them to rapidly iterate. This trades off against experimental accuracy and risk of collisions. The following best practices help minimize those risks:

  1. Avoiding Severe Experimental Collisions: Some amount of risk-tolerance is needed as over-worrying about experimental collisions severely dampens speed and scale. The truth is that most experimental collisions can be anticipated in advanced and managed by small teams. Adding some simple safeguards and team processes can be quite effective to minimize this risk without compromising speed. Some tools include sharing the experimentation roadmap, and having infrastructure that supports isolation when necessary. It can also be helpful to add monitoring that detects interaction effects.
  2. Prioritize Directionality over Accuracy: In most cases, experimentation’s primary goal is to achieve a ship or no ship decision, while measurement is purely secondary. It’s far more important to know that an experiment was “good” rather than whether revenue was lifted by 2.0% or 2.8%. Chasing accuracy can be quite misleading.
  3. Special Strategies when Precision matters: In cases where precision matters one should try alternative strategies. Long-term experiments (eg. holdbacks) or multivariate experimentation are great at getting precise results while keeping teams moving.

Suggested Reading:

  1. Microsoft’s Online Controlled Experiments At Scale, https://exp-platform.com/Documents/2013%20controlledExperimentsAtScale.pdf
  2. How Google Conducts More experiments, https://medium.datadriveninvestor.com/how-google-conducts-more-better-faster-experiments-3b91446cd3b5
  3. Can You Run Multiple AB Tests At The Same Time, https://cxl.com/blog/can-you-run-multiple-ab-tests-at-the-same-time/
  4. Large Scale Experimentation at Spotify, https://www.infoq.com/news/2016/12/large-experimentation-spotify/
  5. Running Multiple A/B Tests at The Same Time: Do’s and Dont’s, https://blog.analytics-toolkit.com/2017/running-multiple-concurrent-ab-tests/

Try Statsig Today

Explore Statsig’s smart feature gates with built-in A/B tests, or create an account instantly and start optimizing your web and mobile applications. You can also schedule a live demo or chat with us to design a custom package for your business.

MORE POSTS

Recently published

My Summer as a Statsig Intern

RIA RAJAN

This summer I had the pleasure of joining Statsig as their first ever product design intern. This was my first college internship, and I was so excited to get some design experience. I had just finished my freshman year in college and was still working on...

Read more

Long-live the 95% Confidence Interval

TIMOTHY CHAN

The 95% confidence interval currently dominates online and scientific experimentation; it always has. Yet it’s validity and usefulness is often questioned. It’s called too conservative by some [1], and too permissive by others. It’s deemed arbitrary...

Read more

Realtime Product Observability with Apache Druid

JASON WANG

Statsig’s Journey with Druid This is the text version of the story that we shared at Druid Summit Seattle 2022. Every feature we build at Statsig serves a common goal — to help you better know about your product, and empower you to make good decisions for...

Read more

Quant vs. Qual

MARGARET-ANN SEGER

💡 How to decide between leaning on data vs. research when diagnosing and solving product problems Four heuristics I’ve found helpful when deciding between data vs. research to diagnose + solve a problem. Earth image credit of Moncast Drawing. As a PM, data...

Read more

The Importance of Default Values

TORE

Have you ever sent an email to the wrong person? Well I have. At work. From a generic support email address. To a group of our top customers. Facepalm. In March of 2018, I was working on the games team at Facebook. You may remember that month as a tumultuous...

Read more
ANNOUNCEMENT

CUPED on Statsig

CRAIG

Run experiments with more speed and accuracy We’re pleased to announce the rollout of CUPED for all our customers. Statsig will now automatically use CUPED to reduce variance and bias on experiments’ key metrics. This gives you access to a powerful experiment...

Read more

We use cookies to ensure you get the best experience on our website.

Privacy Policy