Speeding up A/B tests with discipline

Tue Jun 24 2025

A/B testing can feel like marathons rather than speedruns if you’re not equipped with the right tools.

Imagine this: you’ve planned the perfect A/B test for checkout conversion improvements, but based on your current traffic, you’ll need at least 400k transactions in each cell to spot a 1% lift. With your current traffic, that means eleven weeks of waiting to gather enough user data. That’s longer than a quarter, longer than your patience, longer than your roadmap, and… longer than before your next performance review is due.

A/B testing can feel like marathons rather than speedruns if you’re not equipped with the right tools: 6-11 weeks long, bogged down by the need for large sample sizes and rigid statistical plans. The following is a field guide for shrinking those eleven weeks with the right techniques and tools.

In this post, we’ll show you how to speed up your A/B testing by using a wide toolbox of smarter statistical methods to shrink timelines from months to weeks.

Run tests concurrently by default

The first lever to speed up is obvious—show the test to more users—but the way to do it is through concurrency, not headcount. Run several experiments at once instead of queuing them up.

In case you're worried about interaction effects, Microsoft’s 2023 study A/B Interactions: A Call to Relax scanned hundreds of overlapping experiments and found true interaction effects to be super rare. In practice, the delay caused by isolating every test is far worse than the risk of a quirky interaction. So let your experiments run side-by-side and monitor your results after the fact. You can use automated detectors like Statsig’s interaction-effect detection to do the bookkeeping for you.

If you have clear results for mutually exclusive tests, you can still place tests in layers, but remember that every new layer cuts your traffic into thinner slices.

Use proxies, not your KPIs

The second lever is about moving the metric, not the users. Revenue is often considered the ultimate metric, but it's painfully slow to move because only a fraction of visitors buy. Spend volume also introduces additional noise. Instead, a proxy metric like "clicks on 'Add to cart'" will fire ten times as often and is tightly correlated with purchase, effectively increasing your sample size by ten-fold.

Look for proxy metrics that are logically aligned with your goal but can provide signals faster and earlier, and then verify them by running correlation analysis on historical data. A proxy metric is useful when:

  • It sits up-funnel from the target outcome.

  • Historical data shows a stable correlation with the downstream KPI.

  • It is less susceptible to external shocks (holidays, marketing pulses).

  • It is less noisy.

You don’t have to abandon the revenue metric—keep it as a guardrail—but instead, decide your experiment results based on the proxy.

Boost signal and reduce noise with thoughtful statistics

Low noise means narrower confidence intervals, which means fewer samples are required. The classics still work:

Covariate adjustment (CUPED & CURE)

CUPED subtracts the variation explained by a pre-experiment covariate. Statsig’s CURE generalizes that trick with lasso-style regression on arbitrary covariate data, delivering up to a 40 % variance cut, even for new users.

For practitioners, CURE and CUPED are extremely powerful techniques because you sacrifice nothing (as they are consistent estimators), but they reduce variance with simple linear regressions. Meanwhile, once you apply CUPED, you are over 80% there. Most other statistical methods require bias-variance trade-offs.

That said, don’t get too fancy, as the integrity and trustworthiness of the experimentation program are essential. If you use a novel method and the result is flipped, overall trust can evaporate very quickly.

Winsorization and thresholding

Heavy-tailed spend metrics crack traditional variance formulas. Trimming the top 0.1% or capping at, say, $ 1,000 per order tames the tails. Do it carefully and document the rule so nobody cries foul.

Stratified sampling

Start with balanced groups rather than praying randomization evens out whales and drive-by users. Stratified assignment, now available in Statsig, gives you symmetry on day one.

Put together, these tricks can reduce your noise with discipline, rigor, and interpretability. We’re statisticians, not magicians, but this is close.

Adaptive allocation & fast decisions

Contextual bandits for shallow tests

When the decision rule is binary (“pick the winner, kill the loser”) purely based on quantitative data (CvR), a contextual multi-armed bandit (CMAB) keeps pushing traffic toward the champ. It won’t help a pixel-perfect calibration study, but it’s dynamite for headline copy or color tests. Statsig’s AutoTune module is one-click if you want the machinery without the math.

Sequential testing

Peeking at your experiment is treated as a sin because it can inflate the long-term false discovery rate, but it’s unrealistic to only look at results after experiments mature. We want to peek. Methods like mSPRT will let you keep the family‑wise error rate intact while allowing you stop early when evidence is overwhelming.

Benjamini–Hochberg for multiple metrics

Separately, the BH procedure controls the false‑discovery rate when you report a wide dashboard of metrics at the final read‑out. It’s cheap insurance against metric‑mining, but it does not handle repeated looks; combine it with mSPRT, and you will correct common mistakes with p-hacking.

Bayesian framing

A no-prior (a.k.a. Jeffreys prior) Bayesian readout converts your data into a probability of “variant B is better”. Bring an only when you have credible historical evidence—and be honest about it. Bayesian inference changes interpretations, not data, so wield it as a narrative tool, not a magic wand.

The last mile: Interpretation & judgment

Even after all the stats, we still have to think. Tom Cunningham summarizes our tasks as experimental data scientists as, given the observed effects from the experiments, 1) interpreting the true effect; 2) extrapolate the long term effect. His essay is mandatory for serious product experimentation professionals.

Talk to a handful of real users, scan session replays, and ask yourself why the variant moved the metric. A sample of ten conversations may reveal more nuance than ten million rows of logs.

Don’t bend the rules

Speed isn’t about cutting corners; it’s about clearing the path. Run more tests concurrently, listen to louder signals sooner, squeeze the noise out of your data, and keep your statistical house in order. Your future self—three weeks older instead of eleven—will thank you.

Looking for a smarter way to ship?

Statsig combines enterprise-grade feature flags with your product metrics, helping you ship fast, without breaking things
isometric cta: B2BSaas



Please select at least one blog to continue.

Recent Posts

We use cookies to ensure you get the best experience on our website.
Privacy Policy