A/B testing helps you create the best version of a product tailored for your customers. E-commerce applications are inherently primed to implement A/B tests because these teams running these applications are already heavily metrics-driven and track conversion at every point. Yet, more e-commerce customers ask us every day, “What do we test?”
Let’s set the basic context with the most common metrics in e-commerce and then get into some playbooks on what to test.
The most common metrics for e-commerce businesses are conversions, average order value, frequency of purchase, customer lifetime value, and customer acquisition cost.
Conversions span the entire customer journey, from search to cart, cart to checkout, checkout to purchase, and first purchase to repeat purchase
Average order value (AOV) is a function of converting customer interest to intent, say through personalized recommendations, featured deals and promotions, shipping incentives, and so on
Purchase frequency is a function of a positive customer experience, reinforced by historical familiarity, trust, and loyalty incentives
Average order value and frequency of purchase determine the customer lifetime value (CLV) that sets the high end mark for customer acquisition costs (CAC)
For experiments in e-commerce, conversion rates are often the primary metrics that determine the success or failure of an experiment. A statistically significant improvement in conversion marks the experiment as a good candidate to roll out to all users. This is because (a) conversion is actionable and sensitive to actions that a small team can test, and (b) improving conversion directionally improves output business metrics such as total gross merchandise value (GMV) that aren’t as actionable at the team level.
AOV and purchase frequency often serve as guardrail metrics to ensure that the team doesn’t over-index on short term conversions instead of long term customer sentiment and purchase behavior. Application performance also provides common guardrail metrics such as page load time or error and crash rates.
Borrowing from Booking.com, the first approach is to validate whether every update to the application has the expected impact. This method of ‘testing every atomic change’ is so effective that Booking.com enjoys conversion rates 2–3x higher than industry average. Stuart Frisby, Director of Design at Booking.com, explains their approach:
Almost every product change is wrapped in a controlled experiment. From entire redesigns and infrastructure changes to the smallest bug fixes, these experiments allow us to develop and iterate on ideas safer and faster by helping us validate that our changes to the product have the expected impact on the user experience. If it can be a test, test it. If we can’t test it, we probably don’t do it.
Booking.com also runs “non-inferiority tests” to identify any regressions in guardrail metrics such as error rates and customer support inquiries. For example, when they introduced the “Print Receipt” feature, they ran an A/B test to measure the impact of the new feature on Customer Support calls. The experiment suggested a 0.78% increase, less than the pre-defined threshold of 2%, marking this experiment a success.
The second approach is to set a top-down direction based on an essential, unchanging customer need. As Jeff Bezos said about Amazon.com, “We don’t make money when we sell things. We make money when we help customers to make purchase decisions.”
“Working backwards” from an aspirational vision but staying relentless about course-correcting is a playbook that Amazon has perfected. Perhaps what makes Amazon especially unique is its ability to embrace failure as organizational learning, making the company’s unique cultural traits heavily path-dependent. Bezos has explained this in some detail:
You really can’t accomplish anything important if you aren’t stubborn on vision. But you need to be flexible about the details because you gotta be experimental to accomplish anything important, and that means you’re gonna be wrong a lot. You’re gonna try something on your way to that vision, and that’s going to be the wrong thing, you’re gonna have to back up, take a course correction, and try again. Most large organizations embrace the idea of invention, but are not willing to suffer the string of failed experiments necessary to get there.
A key aspect of this playbook is to ask what’s the smallest big step you can take to test the riskiest assumption of your vision. Ideally, this experimental step will generate measurable results that either meaningfully validate your assumption or pointedly surprise you with an insight that changes your assumption. For example, if you’re testing product pricing and assume that customers always prefer lower prices, an experiment may reveal that below a certain price range your customers begin to lose trust in your product. Not surprisingly, there is lot of room to experiment with pricing in e-commerce!
The second level of this playbook is to recognize behavioral characteristics that help users achieve their objectives. In the example below, adding a customer testimonial improved credibility with the users and increased conversion rate by 35%.
The third level of this playbook includes tactical steps to remove unwanted friction. Any action that requires the user to slow down adds a point of friction. If it doesn’t add value to the user at some point, it’s unwanted friction. In the example below, reducing input fields to only what’s necessary (and adding security certification with improved button copy) increased the revenue per order by 56%.
Poor presentation of information can also add unwanted friction. Here’s an example where structuring product information and highlighting a single CTA increased conversion rate by 58%.
Removing unwanted friction is an ongoing, iterative effort. One of the best books that have helped me identify and address unwanted friction in e-commerce applications is Don’t Make Me Think by Steve Krug. It’s a short and delightful read!
The third approach focuses on growth. For example, Pinterest’s dedicated growth team focuses on conversion, turning prospective users into active users. To improve conversion, they come up with ideas for improvements, use experiments to measure the change, and analyze results before rolling out the change to all users.
Pinterest initially set up a bottom-up approach where individual team members were tasked with coming up with new ideas but found that team members didn’t know how to come up with high quality ideas. Their recent Experiment Idea Review (EIR) process now requires team members to actively build the skills for generating high quality ideas and measures their performance based on these ideas.
For example, the EIR process requires team members to clearly outline the problem, hypothesis, opportunity size, and expected impact from their proposed experiment in a document. Team-leads review these documents ahead of a team review to spot any gaps and further flesh out these ideas. After a review, the team green lights promising proposals and adds them to a backlog. With each experiment, the growth team builds more resources and improves their skills to raise the bar for the next set of ideas.
While this is admittedly the least concrete approach, think of it as a meta-approach to build the clock that tells the time rather than simply telling the time when someone asks. Leading by example and hiring thoughtful growth leaders may be the most meaningful takeaways here.
What’s best for you depends on your leaders, your organizational culture, and how deeply your organization cares about incorporating data in decision making. At Statsig, we help e-commerce organizations of all sizes bootstrap their experimentation, whether it is in service of their culture, vision, or growth.
But every approach begins with running the first experiment.
If you’re already using feature flags to ship software, the easiest way to run an A/B test is with no additional effort — see Statsig’s smart feature gates to kick off an A/B test within minutes.
The good news about getting started is that it automatically generates data that fuels more new ideas for growth and experimentation.
Want to chat more about your e-commerce application and find ideas to experiment in your business? Join the conversation on the Statsig Slack channel.
Detect interaction effects between concurrent A/B tests with Statsig's new feature to ensure accurate experiment results and avoid misleading metric shifts. Read More ⇾
Statsig's biggest year yet: groundbreaking launches, global events, record scaling, and exciting plans for 2025. Explore our 2024 milestones and what’s next! Read More ⇾
A guide to reporting A/B test results: What are common mistakes and how can you make sure to get it right? Read More ⇾
Understand the difference between one-tailed and two-tailed tests. This guide will help you choose between using a one-tailed or two-tailed hypothesis! Read More ⇾
This guide explains why the allocation point may differ from the exposure point, how it happens, and what you to do about it. Read More ⇾
From continuous integration and deployment to a scrappy, results-driven mindset, learn how we prioritize speed and precision to deliver results quickly and safely Read More ⇾