Sequential Testing

Sequential Testing is a method used in the field of statistical analysis, particularly in the context of A/B testing. Traditional A/B testing practices dictate that the readout of experiment metrics should occur only once, when the target sample size of the experiment has been reached.

However, continuous monitoring for the purpose of decision-making can result in inflated false positive rates, a phenomenon known as the peeking problem. This is because p-values fluctuate and are likely to drop in and out of significance just by random chance, even when there is no real effect.

Sequential Testing addresses this issue by adjusting the p-values for each preliminary analysis window to compensate for the increased false positive rate associated with peeking. The goal is to enable early decision-making when there's sufficient evidence while limiting the risk of false positives.

Key Features of Sequential Testing

  1. Early Decision Making: Sequential Testing allows for early decision-making when there's sufficient evidence, while limiting the risk of false positives. This is particularly valuable in cases where there are unexpected regressions or significant opportunity costs associated with delaying the experiment decision.

  2. Mitigating the Peeking Problem: Sequential Testing helps mitigate the increased false positive rate associated with the "peeking problem". This problem arises when continuous monitoring of an experiment leads to inflated false positive rates.

  3. Adjustment Factor: In Sequential Testing, an adjustment factor is used that's determined by the number of days the experiment has been running. When the target duration is reached, no more adjustments are applied.

Examples of Sequential Testing

Sequential Testing is particularly valuable in cases such as:

  • Unexpected Regressions: Sometimes experiments have bugs or unintended consequences that severely impact key metrics. Sequential testing helps identify these regressions early and distinguishes significant effects from random fluctuations.

  • Opportunity Cost: This arises when a significant loss may be incurred by delaying the experiment decision, such as launching a new feature ahead of a major event or fixing a bug. If sequential testing shows an improvement in the key metrics, an early decision could be made.

Interpreting Sequential Testing Results

When enabled, an adjustment is automatically applied to results calculated before the target completion date of the experiment. The dashed line represents the expanded confidence interval resulting from the adjustment. The solid bar is the standard confidence interval computed without any adjustments. If the adjusted confidence interval overlaps with zero, this means the metric delta is not stat-sig at the moment, and the experiment should continue its course as planned.

Sequential Testing Best Practices

Sequential Testing should be used in combination with traditional, time-bound hypothesis testing. It's recommended to use Sequential Testing to identify regressions early and Traditional Hypothesis Testing for full statistical power across all metrics of interest. It's also important to remember that while one metric may have crossed the efficacy boundary, other metrics that appear neutral may be stat-sig at the end of the experiment.

Join the #1 Community for Product Experimentation

Connect with like-minded product leaders, data scientists, and engineers to share the latest in product experimentation.

Try Statsig Today

Get started for free. Add your whole team!

What builders love about us

OpenAI OpenAI
Brex Brex
Notion Notion
SoundCloud SoundCloud
Ancestry Ancestry
At OpenAI, we want to iterate as fast as possible. Statsig enables us to grow, scale, and learn efficiently. Integrating experimentation with product analytics and feature flagging has been crucial for quickly understanding and addressing our users' top priorities.
Dave Cummings
Engineering Manager, ChatGPT
Brex's mission is to help businesses move fast. Statsig is now helping our engineers move fast. It has been a game changer to automate the manual lift typical to running experiments and has helped product teams ship the right features to their users quickly.
Karandeep Anand
At Notion, we're continuously learning what our users value and want every team to run experiments to learn more. It’s also critical to maintain speed as a habit. Statsig's experimentation platform enables both this speed and learning for us.
Mengying Li
Data Science Manager
We evaluated Optimizely, LaunchDarkly, Split, and Eppo, but ultimately selected Statsig due to its comprehensive end-to-end integration. We wanted a complete solution rather than a partial one, including everything from the stats engine to data ingestion.
Don Browning
SVP, Data & Platform Engineering
We only had so many analysts. Statsig provided the necessary tools to remove the bottleneck. I know that we are able to impact our key business metrics in a positive way with Statsig. We are definitely heading in the right direction with Statsig.
Partha Sarathi
Director of Engineering
We use cookies to ensure you get the best experience on our website.
Privacy Policy