Carry-over effects: Previous test impacts

Mon Jun 23 2025

Ever run an A/B test where your results looked too good to be true? Or maybe they made absolutely no sense? There's a good chance you've stumbled into one of experimentation's trickiest problems: carry-over effects.

Here's the thing - when users experience one version of your product, that experience doesn't just vanish when they see the next version. It sticks around, influencing how they interact with whatever comes next. And if you're not careful, these lingering effects can completely wreck your experiment results.

Understanding carry-over effects in experiments

Let's get specific about what we're dealing with. Carry-over effects happen when exposure to one treatment changes how people respond to the next treatment. It's like trying to taste wine after eating something spicy - that first experience colors everything that follows.

This is especially problematic in within-subject designs, where the same users see multiple versions of your product. Think about it: if someone struggles through a confusing checkout flow in version A, they're going to approach version B differently. Maybe they're more cautious. Maybe they've learned workarounds. Either way, you're not getting a clean read on version B's actual performance.

The usual suspects behind carry-over effects are pretty predictable:

  • Learning: Users figure out your interface and get better at using it over time

  • Fatigue: They get tired or bored after multiple interactions

  • Adaptation: They change their behavior based on what they've already experienced

The cognitive testing community on Reddit has been wrestling with this forever. Take matrix reasoning tests - once you've solved a few puzzles, you start recognizing patterns. Your improved performance on the next test isn't necessarily because the test is easier; it's because you've gotten better at that type of thinking.

Speech therapists face similar challenges. When they teach communication skills in therapy sessions, they need to know if those skills transfer to real-world conversations. But measuring that transfer? That's where things get messy.

The impact of carry-over effects on experimental validity

Here's where things get serious. Carry-over effects don't just add noise to your data - they systematically bias your results. As statisticians point out, these effects compromise your experiment's internal validity. You think you're measuring the impact of your new feature, but you're actually measuring a mix of your feature plus whatever baggage users are carrying from their previous experiences.

Clinical trials have dealt with this problem for decades. Imagine testing two medications where the first one stays in someone's system for weeks. Even with a washout period, you can't be sure the second medication's effects are purely its own. The same principle applies to your product experiments, just with different timescales.

Microsoft's experimentation team discovered that online experiments face unique challenges. Cookie churn is a big one - when users clear cookies or switch devices, they might get re-randomized into different experiment groups. Suddenly, someone who learned your interface in the control group is experiencing the treatment as if they're a new user. Your "clean" randomization isn't so clean anymore.

The psychometric testing world offers another cautionary tale. Give someone the same IQ test twice, and they'll almost always score higher the second time. Not because they got smarter overnight, but because practice effects are real and powerful.

Strategies to mitigate carry-over effects in experimental design

So what can you actually do about this? Good news: researchers have developed several effective strategies over the years.

Counterbalancing is your first line of defense. Instead of showing all users condition A then condition B, you randomize the order. Half see A then B, half see B then A. This way, any carry-over effects get distributed across both conditions instead of systematically favoring one. It's not perfect, but it's way better than nothing.

Washout periods can help, but they're tricky to get right. In theory, you wait long enough between treatments for the effects to wear off. But how long is long enough? For a UI change, maybe a few days. For a fundamental workflow change, maybe weeks. You need to understand your specific context.

Here's where things get practical. At Statsig, we've seen teams successfully use these approaches:

  • Run between-subjects designs when possible (different users see different versions)

  • Use statistical models that explicitly account for treatment order

  • Monitor for signs of carry-over in your metrics

  • Design shorter experiments to minimize accumulation of effects

Statistical tests for carry-over effects exist, but they're not magic bullets. Mixed-effects models can help separate true treatment effects from carry-over, but they require careful setup and interpretation.

The debate around practice effects in cognitive testing reminds us that not everyone agrees on how serious these effects are. Some argue they're overblown. But when you're making million-dollar decisions based on experiment results, it's better to be cautious.

Best practices for interpreting results in the presence of carry-over effects

Let's talk about what to do when you suspect carry-over effects are messing with your results.

First, look for the warning signs. Plot your metrics over time. Do you see gradual shifts that don't align with your experiment timeline? Are users who've been in the experiment longer showing different patterns than newcomers? These temporal patterns often reveal carry-over effects.

Smart experiment design starts before you launch. Think through the user journey: where might previous experiences influence future behavior? Build in checkpoints to monitor for these effects. The teams at Statsig often help customers design experiments that preemptively address these concerns.

When you find carry-over effects, be honest about them. Don't bury the finding in a footnote. Call it out clearly: "Users who experienced the old checkout flow showed 15% lower conversion in the new flow, likely due to learned behaviors." Your stakeholders need the full picture to make good decisions.

Here's a practical framework for ongoing monitoring:

  1. Check for order effects in your analysis

  2. Compare early vs. late cohorts in your experiment

  3. Look for unexpected interactions between experiments

  4. Document any suspicions for future reference

Working with experienced analysts makes a huge difference here. They've seen these patterns before and can spot subtle signs you might miss. Statistical expertise isn't just nice to have - it's essential for navigating these complexities.

Closing thoughts

Carry-over effects are one of those experimentation challenges that never fully goes away. You can minimize them, account for them, and work around them, but they're always lurking in the background, ready to muddy your results.

The key is staying vigilant. Build carry-over considerations into your experiment design from day one. Use the mitigation strategies we've discussed. And when you do see signs of these effects, don't panic - adjust your analysis and interpretation accordingly.

Want to dive deeper? Check out:

Remember: perfect experiments don't exist. But understanding carry-over effects - and planning for them - gets you a lot closer to results you can actually trust.

Hope you find this useful!



Please select at least one blog to continue.

Recent Posts

We use cookies to ensure you get the best experience on our website.
Privacy Policy