This is achieved by adjusting the p-values and confidence intervals to account for the increase in false positive rates associated with continuous monitoring of experiments. Here we outline our approach to Sequential Testing and recommended best practices.
A common concern when running online A/B tests is the “peeking problem”, the notion that making early ship decisions as soon as statistically significant results are observed leads to inflated false positive rates. This stems from a tension between two aspects of online experimentation:
Ongoing metric updates
Unlike A/B tests conducted in fields like Psychology and Drug Testing, state-of-the-art online experimentation platforms use live data streams and can surface results immediately. These results can then be updated to reflect the most up-to-date insights as data collection continues. Naturally, we want to leverage this powerful capability to make the best decisions as early as possible.
Limitations of the underlying statistical test
In hypothesis testing, we accept a predetermined false positive rate, typically 5% (alpha = 0.05). When the p-value is less than 0.05, it’s common practice to reject the null hypothesis and attribute the observed effect to the treatment we’re testing. We do this knowing that there’s a 5% chance that a statistically significant result is actually just random noise.
However, ongoing monitoring while waiting for significance leads to a compounding effect of the 5% false positive rate. Imagine you have a 20-sided die. If you roll it once, you’ll have a 5% (1 in 20) chance of getting a 1. But if you roll it every day for a week, the probability of getting a 1 at least once is much higher than 5%. In fact, you’ve now increased your chances to 30%.
In Sequential Testing, the p-value computation changes in a way that mitigates the higher risk of false positive rates associated with peeking. The goal is to enable early decisions without increasing false positive rates by adjusting the significance threshold to effectively raise the bar for what constitutes statistically significant results early on.
Different approaches exist for computing the adjusted p-values in Sequential Testing. At Statsig, we selected one that fits in with our A/B testing philosophy. It’s an adaptation of the Group-Sequential T-Tests for Two Means methodology described here, which meets the following requirements:
Simplicity: The calculation of Sequential Testing p-values is easy to explain and reproduce, and requires no additional setup from the end user. It’s based entirely on the number of days that have elapsed relative to the planned duration of the experiment.
No loss of power: When the target duration is reached, the Sequential Testing approach converges with the traditional A/B testing methodology. This means that if the experiment is carried out in full, we retain all of the statistical power expected based on the experiment setup.
Holistic Decision-Making: This goes hand-in-hand with the bullet point above. Our design is intended to provide high confidence when making early decisions in specific scenarios, such as a major regression in a key metric. We still expected that the majority of experiments will be carried out to completion and decisions will be based on the full scorecard of primary and secondary metrics.
Part of designing experiments includes setting a target duration upfront. This is the number of days needed to detect the desired effect size, assuming there’s a real effect. In practice, there tend to be several metrics of interest with different variances and effect sizes, which require different sample sizes and durations. We recommend selecting a duration that yields enough power for all key metrics.
When “peeking” at an experiment before the completion date, the confidence intervals are expanded to reflect the higher uncertainty at this point in time. If the adjusted confidence interval crosses zero, it means there’s not enough data yet to make a decision based on this metric, even if the traditional p-value is stat-sig. The adjustment decreases as the experiment progresses and goes away when the target duration is reached.
To better understand the Sequential Testing progression over the course of an experiment, it helps to think in terms of the thresholds that determine whether an effect is significant or not. These are commonly referred to as efficacy boundaries, represented by the curved lines in the diagram below.
When the Z-score calculated for a metric delta is above the upper boundary, the effect is stat-sig positive. Conversely, a Z-score below the lower boundary signifies a negative stat-sig result. Early in the experiment, the efficacy boundaries are high. Intuitively, this means that a much higher significance threshold must be crossed in order to make an early decision, when the sample size is still small. The boundaries are adjusted with each passing day. At the end of the pre-determined duration, they reach the standard Z-score for the selected significance level (e.g.: 1.96 for 2-sided tests with 95% confidence intervals)
All of the Sequential Testing calculations we perform are based on a simple adjustment factor: The number of completed days divided by the total expected duration. Detailed equations for adjusted p-values and confidence interval can be found here.
The best use of Sequential Testing is often in combination with traditional, time-bound hypothesis testing. Similar to the ideas shared by Ronny Kohavi in this post, we recommend an approach that leverages both: Sequential Testing to identify regressions early and Traditional Hypothesis Testing for full statistical power across all metrics of interest.
When Early Decisions are Beneficial
While “peeking” is sometimes frowned upon, early monitoring of tests is actually critical to getting the most value out of an experimentation program. If an experiment introduces a measurable regression, there’s no reason to wait until the end to take action. With sequential testing, we can readily distinguish between statistical noise and strong effects that are significant early on.
Another use-case for Sequential Testing is when there’s an opportunity cost to running the experiment for its full duration. For example, when withholding an improvement from a subset of users comes at significant engineering or business cost, or when ending an experiment unblocks the path for further tests.
Don’t Forget About Guardrails
It’s exciting to see a goal metric with a stat-sig effect early on. A word of caution before making an early decision: While one metric may have crossed the efficacy boundary, other metrics that appear neutral may be stat-sig at the end of the experiment. The efficacy boundary is helpful in identifying stat-sig results early, but doesn’t distinguish between no true effect and insufficient power before the target duration is reached.
Account for Weekly Seasonality
Even when all metrics of interest look great early on, it’s often advisable to wait at least 7 full days before making a decision. This is because many metrics are impacted by weekly seasonality when the end users of a product have different behaviors depending on the day of the week.
If a good estimate of the effect size is important, consider running the experiment to completion. For one, Sequential Testing adjusted confidence intervals are broader, so the range of likely values is larger when making an early decision (lower precision). Additionally, a larger measured effect is more likely to be statistically significant early on, even if the true effect is actually smaller. Routinely making early decisions based on positive stat-sig results could lead to systematically overestimating the impact of launched experiments (lower accuracy).
Kong is our Typescript-based write-once-run on every SDK framework. “Write once, run anywhere” is always a dream for programmers, and now we have just that!
LaunchDarkly was mandatory for every new feature in Motion’s backend, web app, and Chrome extension. "It was obvious this was a huge mistake."
Last Tuesday, Statsig brought a cadre of data science and experimentation fans together at a loft space in San Francisco for the first-ever Data Science Meetup.
Well-designed experimentation is the first step in creating a rollout structure that consistently delivers optimal results—whatever they may be.
Using data and experimentation, the Obama 2012 campaign generated over one billion dollars in donations, nearly $700,000,000 of which were online.
It’s only my first week yet, but each day I am more and more impressed by the team’s velocity, excitement, and transparency, and feeling more sure that I’ve made the right decision for /me/.
Explore Statsig’s smart feature gates with built-in A/B tests, or create an account instantly and start optimizing your web and mobile applications. You can also schedule a live demo or chat with us to design a custom package for your business.