A/B tests fail most often before the first visitor even lands: the sample is wrong. Too small and the read swings around like a weather vane. Too big and you burn time and traffic for marginal gains. The trick is choosing a size that reflects real behavior, not noise, so decisions hold up when repeated.
This guide shows how to right‑size tests. Expect plain rules, quick checks, and links to deeper reads from CXL, HBR, and Statsig’s own perspectives.
A right-sized sample gives stable estimates instead of random swings. That means you judge practical lift, not chance. As the folks at CXL put it, statistical power is the core idea behind getting to confident, repeatable calls CXL.
Too few users increase Type II errors: real wins slip by. Missed lifts hurt engagement and revenue, which is why teams size tests around a minimum detectable effect, or MDE, and set clear tradeoffs up front Mida.
There are a few knobs that move sample size in predictable ways:
Higher baseline rates boost power and reduce the users you need SplitMetrics.
A smaller MDE demands more users to see that subtle change PMC.
A stricter alpha threshold increases the sample required for the same power CXL.
If a result looks weak, do not guess. Check achieved power before calling it inconclusive, so you know whether the test could have found the lift you care about in the first place Statsig.
Statistical power is the chance your test detects a real effect. Alpha controls false alarms; power controls misses. Aim near 80 percent power for most product decisions, and nudge higher only when the stakes are huge Mida, PMC.
What actually pushes power up or down:
Sample size: more users, more power SplitMetrics.
MDE: smaller effects require more data to detect CXL.
Alpha: stricter thresholds need larger samples for the same power Mida.
Base rate: higher baselines make detection easier Statsig.
Turn that into action. Set MDE and alpha first, then pick a power target. Estimate sample size with a trusted calculator like Statsig’s, which handles common A/B setups cleanly Statsig. If timelines choke, either raise the MDE or extend the run; both are valid, and CXL’s primer walks through the tradeoffs CXL.
Lock a baseline and a realistic MDE
Pull the control rate from recent data and check seasonality before you commit. New to the basics of test design or power? The HBR refresher on A/B testing and CXL’s guide to power are helpful primers HBR, CXL.
Set alpha, power, and the test side
Decide one-sided or two-sided based on the decision you plan to make. Defaults of 0.05 alpha and 0.8 power are standard unless risk suggests otherwise Mida, SplitMetrics. One-sided tests save users when direction truly matters.
Match the method to the metric
Use a difference-in-means or proportion test when that is the decision metric. Skip Mann‑Whitney U for means; it answers a different question and often misleads in A/B work Analytics Toolkit.
Run the math in a credible calculator
Use a standard formula or a trusted tool, then validate the inputs. Cross-check with a stats primer and Statsig’s walkthroughs to make sure the model aligns with your outcome and split PMC, Statsig how‑to, Statsig power overview.
Do quick sensitivity checks before launch
Here is what to poke at:
Shift MDE, alpha, and power; watch how users per arm move.
Compare one-sided vs two-sided; only pick one-sided if a worse outcome would be ignored in your decision.
Recompute achieved power after a null; it clarifies whether the test was truly inconclusive CXL, r/AskStatistics.
Pressure-test timelines against traffic
Estimate days to reach sample at your current split. If the plan stalls, cut variants or raise the MDE; both routes are common and practical Mida. Community threads often share useful calculators and feedback on edge cases r/datascience, r/statistics.
Every team hits traffic walls. When sample is tight, optimize for decisions you can stand behind, not theoretical perfection.
Here are the fastest levers:
Raise the MDE to chase bigger wins; that boosts power with fewer users CXL.
Extend the test; fast reads are nice, correct reads are better Mida.
Cut variants so each arm gets more traffic; HBR’s playbook backs this kind of focus HBR.
Stabilize variance with history where appropriate. Combine recent control data with current metrics using consistent definitions and windows, then recheck sample needs. This reduces noise and tightens estimates without magic tricks PMC, r/AskStatistics.
A few advanced knobs also help:
Pick one-sided tests when direction is guaranteed by policy; you save users per arm.
Rebalance traffic toward control if risk is high, then account for the unequal split in the calculator.
Set alpha and power intentionally; a lower alpha or higher power can be worth the extra days for high-stakes launches Statsig, SplitMetrics.
Validate the plan in a calculator that supports your setup, including unequal splits and multiple groups. Community threads highlight common gotchas and handy tools to sanity-check the math r/datascience, r/statistics. Statsig’s perspectives also walk through sample sizing end to end with clear examples Statsig.
The right sample size is a decision tool, not a vanity metric. Set the MDE and alpha first, target around 80 percent power, and validate the plan against your traffic and timeline. When in doubt, check achieved power before declaring a draw.
For a deeper dive, try these favorites: CXL’s guide to power CXL, HBR’s A/B refresher HBR, the open primer on power and effect size PMC, and Statsig’s posts on power and sample sizing with calculators you can use today Statsig power, Statsig sizing.
Hope you find this useful!