The Causal Roundup is a biweekly review of industry leading work in causality. From experimentation to causal inference, we share work from teams who are building the future of product decision making. In this week’s edition, we focus on processes and infrastructure - the force multipliers for every team.
A key aspect of building an experimentation culture is standardizing how different teams execute and interpret experiments. Earlier this summer, Booking explained how they get to the heart of the issue: Running bad experiments is just a very expensive and convoluted way to make unreliable decisions.
What really matters to us is not how many product decisions are made, nor how fast decisions are made, but how good those decisions are.
They define a three-point rubric to assess each decision based on design, execution, and shipping criteria. Design establishes the fundamentals by ensuring that teams pay attention to the power of an experiment and record the outcomes they expect. Execution ensures that teams don’t compromise on the duration of the experiment. Shipping formalizes the go/no-go decision based on pre-established shipping criteria.
At a tactical level, ratings for these three dimensions track the performance of a team and department over time. More importantly, at a strategic level, the team responsible for the experimentation platform measures how they’re influencing the quality of company-wide decisions and how their customers (internal teams) are using the platform tools. This enables them to constantly improve their tooling to better serve company-wide objectives around the quality of product decisions.
While Booking.com has shared a lot of awesome work on product experimentation, this is the only instance we’ve seen in the wild of a company setting and raising the bar on their decision making process.
Before going into our next story, it’s worth recounting four broad patterns that we see for serving measurable properties of a system to users:
With me so far? Now on to the story…
As Airbnb scaled, their leaders found that different teams consuming the same application data reported different numbers for simple business metrics. And there was no easy way to know which number was correct. To create one source of truth and use it everywhere, the team built a metrics platform, Minerva.
Define metrics once, and use them everywhere
This metrics platform serves the top two needs I mentioned above: reporting and analytics. However, unlike common reporting and analytics use cases, reporting for experimentation is unique because metrics are only a starting point. We must first join these metrics with user assignment data from the experiment and then compute summary statistics for analysis. Minerva supplies the “raw events” to Airbnb’s Experimentation Reporting Framework (ERF), joins the raw event data with the assignment data, and then ERF calculates the summary statistics. This is exactly the same as many of our customers who record experiment event data as well as metrics from their data warehouse into Statsig to analyze their experiments. Cool!
Looking further into Airbnb’s data management infrastructure (and this is the fun part!)… Minerva is Airbnb’s metric store that serves the first two needs, and Zipline is Airbnb’s feature store that serves the last two needs. There’s significant overlap between the two, particularly in performing long-running offline computations. So I was tickled when I heard about Ziperva, the new converged data store, that’s enjoying successful alpha at Airbnb. Unifying and scaling data management across the company: let’s put a pin on that and come back to it in a future edition! 📍
At Statsig, we talk a lot about experiments. A bit off the beaten track, this experiment is truly about saving lives.
To determine the efficacy of portable air filters (HEPA filters¹) in clearing SARS-CoV-2, a U.K. team installed these filters in two fully occupied COVID-19 wards: a general ward and an ICU. The team collected air samples from these wards for a week with the air filters switched on, and then for two weeks with the filters turned off.
They found SARS-CoV-2 particles in the air when the filter was off but not when it was on. Also surprisingly, the team didn’t find many viral particles in the air of the ICU ward, even when the filter there was off. Here, it’s a quick read! If only we could run more experiments, we would answer a lot more questions 😀
I hope you’re as excited as I am about an ever growing number of teams employing experimentation and uncovering the true causes of user behavior to improve their product decisions. Everyday, we hear from growth teams that really get the value of experimentation. As you scale your growth team, hit us with your questions and we’ll do everything we can to share the best tools, processes, and infrastructure to set you up for success, whether it’s with Statsig or not. Join our Slack channel to also learn from other growth teams who’re cracking new ways to grow their business everyday.
 HEPA or high-efficiency particulate air filters blow air through a fine mesh that catches extremely small particles.
Explore Statsig’s smart feature gates with built-in A/B tests, or create an account instantly and start optimizing your web and mobile applications. You can also schedule a live demo or chat with us to design a custom package for your business.
This summer I had the pleasure of joining Statsig as their first ever product design intern. This was my first college internship, and I was so excited to get some design experience. I had just finished my freshman year in college and was still working on...
The 95% confidence interval currently dominates online and scientific experimentation; it always has. Yet it’s validity and usefulness is often questioned. It’s called too conservative by some , and too permissive by others. It’s deemed arbitrary...
Statsig’s Journey with Druid This is the text version of the story that we shared at Druid Summit Seattle 2022. Every feature we build at Statsig serves a common goal — to help you better know about your product, and empower you to make good decisions for...
💡 How to decide between leaning on data vs. research when diagnosing and solving product problems Four heuristics I’ve found helpful when deciding between data vs. research to diagnose + solve a problem. Earth image credit of Moncast Drawing. As a PM, data...
Have you ever sent an email to the wrong person? Well I have. At work. From a generic support email address. To a group of our top customers. Facepalm. In March of 2018, I was working on the games team at Facebook. You may remember that month as a tumultuous...
Run experiments with more speed and accuracy We’re pleased to announce the rollout of CUPED for all our customers. Statsig will now automatically use CUPED to reduce variance and bias on experiments’ key metrics. This gives you access to a powerful experiment...