Speeding Up A/B Tests with Discipline

Invalid Date

Imagine this: you’ve just scoped a classic A/B test on checkout conversion. Your power analysis says you need 100 k transactions in each cell to spot 1% lift. On your current traffic that means eleven weeks of waiting — longer than a quarter, longer than your patience, longer than your PM’s roadmap, longer than your performance review is due.

What follows is a field guide for shrinking those eleven weeks without cheating. The ideas sit on three pillars—concurrent tests, proxy metrics, less noise—and a fourth, quieter pillar about decision discipline. We’ll keep our running example in sight, showing how every move chips away at the calendar.

I. Concurrent tests shall be the default

The first lever is obvious—show the test to more users—but the highway to scale runs through concurrency. Run several experiments at once instead of queuing them up.

“Won’t the tests collide?”

Microsoft’s 2023 study, A/B Interactions: A Call to Relax, scanned hundreds of overlapping experiments and found true interaction effects to be vanishingly rare. In practice the drag from isolating every test is far worse than the risk of a quirky interaction. So let them run shoulder-to-shoulder and monitor after the fact: Use automated detectors—Statsig’s interaction-effect detection does the bookkeeping for you.

If you have clear results for mutually exclusive tests, you can still place tests in *layers* but remember that every new layer cuts your traffic into thinner slices.

II. Use proxies, not your KPIs

The second lever moves the metric, not the users. Revenue is slow because only a fraction of visitors buy, and spend volume introduces additional noise. A click on “Add to cart” may fire ten times as often and is tightly correlated with purchase, effectively increasing your sample size ten folds.

Look for proxy metrics that are logically aligned with your goal but can give signals faster and earlier, then verify by running correlation analysis on historical data. A proxy metric is useful when:

  • It sits up-funnel from the target outcome.

  • Historical data shows a stable correlation with the downstream KPI.

  • It is less susceptible to external shocks (holidays, marketing pulses).

  • It is less noisy.

You don’t have to abandon the revenue metric—keep it as a guardrail—but you decide on the proxy.

III. Boost signal and reduce noise with thoughtful statistics

Low noise means narrower confidence intervals, which means fewer samples required. The classics still work:

1. Covariate adjustment (CUPED & CURE)

CUPED subtracts the variation explained by a pre-experiment covariate. Statsig’s CURE generalizes that trick with lasso-style regression on arbitrary covariate data, delivering up to a 40 % variance cut, even for new users (https://www.statsig.com/blog/announcing-cure).

For practitioners, CURE and CUPED are extremely powerful techniques, because you sacrifice nothing as they are consistent estimators, but reduced variance with simple linear regressions. Meanwhile, once you applied CUPED, you are over 80% there. Most other statistical methods require bias-variance trade-off. Don’t get too fancy, as the integrity and trustworthiness of the experimentation program is important. If you used a novel method and the result flipped, overall trust can evaporate very quickly.

2. Winsorization and thresholding

Heavy-tailed spend metrics crack traditional variance formulas. Trimming the top 0.1 % or capping at, say, $1 000 per order tames the tails. This does alter the meaning of your data, so do it carefully and document the rule and your reasoning.

3. Stratified sampling

Start with balanced groups rather than praying randomization evens out whales and drive-by users. Stratified assignment, now clickable in Statsig, gives you symmetry on day one. (https://www.statsig.com/blog/introducing-stratified-sampling)

Put together, these tricks can reduce your noise with discipline, rigor and interpretability. We’re statisticians, not magicians, but this is close.

IV. Adaptive Allocation & Fast Decisions

Contextual bandits for shallow tests

When the decision rule is binary (“pick the winner, kill the loser”) purely based on quantitative data (CvR), a contextual multi-armed bandit (CMAB) keeps pushing traffic toward the champ. It won’t help a pixel-perfect calibration study, but it’s dynamite for headline copy or color tests. Statsig’s AutoTune module is one-click if you want the machinery without the math.

Sequential testing

Peeking your experiment is treated as a sin because it can inflate long term false discovery rate, but it’s unrealistic to only look at results after experiments mature. We want to peek. Methods like mSPRT will let you keep the family‑wise error rate intact while letting you stop early when evidence is overwhelming.

Benjamini–Hochberg for multiple metrics

Separately, the BH procedure controls the false‑discovery rate when you report a wide dashboard of metrics at the final read‑out. It’s cheap insurance against metric‑mining but it does not handle repeated looks; combine it with mSPRT, you will correct common mistakes with p-hacking.

Bayesian framing

A no-prior (a.k.a. Jeffreys prior) Bayesian readout converts your data into a probability of “variant B is better”. Bring an only when you have credible historical evidence—and be honest about it. Bayesian inference changes interpretations, not data, so wield it as a narrative tool, not a magic wand.

V. The Last Mile: Interpretation and Judgment

Even after the stats, we still have to think. Tom Cunningham summarizes our tasks as experimental data scientists as, given the observed effects from the experiments, 1) interpreting the true effect; 2) extrapolate the long term effect. His essay is mandatory reading in my opinion.

Talk to a handful of real users, scan session replays, and ask yourself why the variant moved the metric. A sample of ten conversations may reveal more nuance than ten million rows of logs.

Don’t bend rules

Speed isn’t about cutting corners; it’s about clearing the path. Run more tests concurrently, listen to louder signals sooner, squeeze the noise out of your data, and keep your statistical house in order. Your future self—three weeks older instead of eleven—will thank you.

Recent Posts

We use cookies to ensure you get the best experience on our website.
Privacy Policy