The Causal Roundup is a biweekly review of industry leading work in causality. From experimentation to causal inference, we share work from teams who are building the future of product decision making. In this week’s edition, we focus on processes and infrastructure - the force multipliers for every team.
A key aspect of building an experimentation culture is standardizing how different teams execute and interpret experiments. Earlier this summer, Booking explained how they get to the heart of the issue: Running bad experiments is just a very expensive and convoluted way to make unreliable decisions.
What really matters to us is not how many product decisions are made, nor how fast decisions are made, but how good those decisions are.
They define a three-point rubric to assess each decision based on design, execution, and shipping criteria. Design establishes the fundamentals by ensuring that teams pay attention to the power of an experiment and record the outcomes they expect. Execution ensures that teams don’t compromise on the duration of the experiment. Shipping formalizes the go/no-go decision based on pre-established shipping criteria.
At a tactical level, ratings for these three dimensions track the performance of a team and department over time. More importantly, at a strategic level, the team responsible for the experimentation platform measures how they’re influencing the quality of company-wide decisions and how their customers (internal teams) are using the platform tools. This enables them to constantly improve their tooling to better serve company-wide objectives around the quality of product decisions.
While Booking.com has shared a lot of awesome work on product experimentation, this is the only instance we’ve seen in the wild of a company setting and raising the bar on their decision making process.
Before going into our next story, it’s worth recounting four broad patterns that we see for serving measurable properties of a system to users:
Offline Metrics for Reporting and Experimentation: These are computed for a given period and feed into regularly generated reports or experiment analysis. Aside from the pressure of delivering reports on time, the system serving offline metrics generally bears no latency-based constraints.
Interactive Analytics for Exploration: When data analysts or scientists want to roll up their sleeves to explore the data, they use predefined dimension cuts via a dashboard or an interactive query interface that returns data within a few seconds.
Feature Backfill for Model Training: Computing features for model training is similar to computing offline metrics with one additional constraint: point-in-time correctness for features that must be historically accurate. For example, a model may use a feature that counts a given user’s 5 min login count at 11pm a month ago.
Feature Serving for Online Inference: When machine learning models use live features to construct the user experience in real-time, say offering recommendations on where to stay in a city, this requires a row-oriented storage layout that can serve read latencies at the order of ~10ms.
With me so far? Now on to the story…
As Airbnb scaled, their leaders found that different teams consuming the same application data reported different numbers for simple business metrics. And there was no easy way to know which number was correct. To create one source of truth and use it everywhere, the team built a metrics platform, Minerva.
Define metrics once, and use them everywhere
This metrics platform serves the top two needs I mentioned above: reporting and analytics. However, unlike common reporting and analytics use cases, reporting for experimentation is unique because metrics are only a starting point. We must first join these metrics with user assignment data from the experiment and then compute summary statistics for analysis. Minerva supplies the “raw events” to Airbnb’s Experimentation Reporting Framework (ERF), joins the raw event data with the assignment data, and then ERF calculates the summary statistics. This is exactly the same as many of our customers who record experiment event data as well as metrics from their data warehouse into Statsig to analyze their experiments. Cool!
Looking further into Airbnb’s data management infrastructure (and this is the fun part!)… Minerva is Airbnb’s metric store that serves the first two needs, and Zipline is Airbnb’s feature store that serves the last two needs. There’s significant overlap between the two, particularly in performing long-running offline computations. So I was tickled when I heard about Ziperva, the new converged data store, that’s enjoying successful alpha at Airbnb. Unifying and scaling data management across the company: let’s put a pin on that and come back to it in a future edition! 📍
At Statsig, we talk a lot about experiments. A bit off the beaten track, this experiment is truly about saving lives.
To determine the efficacy of portable air filters (HEPA filters¹) in clearing SARS-CoV-2, a U.K. team installed these filters in two fully occupied COVID-19 wards: a general ward and an ICU. The team collected air samples from these wards for a week with the air filters switched on, and then for two weeks with the filters turned off.
They found SARS-CoV-2 particles in the air when the filter was off but not when it was on. Also surprisingly, the team didn’t find many viral particles in the air of the ICU ward, even when the filter there was off. Here, it’s a quick read! If only we could run more experiments, we would answer a lot more questions 😀
This year’s Nobel prize for Economics went to David Card, Joshua Angrist, and Guido Imbens for their contributions to the analysis of causal relationships using natural experiments. While Card has analyzed key societal questions such as the impact of immigration and minimum wages on employment and jobs, Angrist and Imbens have developed new methods to show that natural experiments are rich sources of knowledge to answer such societal questions. Are you there yet on going all in on causal relationships and experimentation?!
LinkedIn explains their end-to-end explainability system, Intellige, that answers the critical “so what?” questions about machine learning model predictions. While the current state of art in model explainability identifies the top features that influence model predictions, Intellige offers the rationale behind model predictions to make them actionable for users.
Teads, a global media platform, talks about their A/B testing analysis framework and the infrastructure behind it. It (a) performs pre-aggregation of logs (Spark), (b) runs a query engine (Amazon Athena), and (c ) publishes results on a dashboard. An improvement over previous analysis tools (Jupyter Notebooks, BigQuery), there’s a lot to like here about building a reliable architecture for analysis even if it’s a bit lighter on the user assignment component (check out the bigger picture if you’re assessing building your own platform vs. buying a service).
Netflix has a lovely blog post on building intuition for statistical significance by flipping coins. Explaining p-value in simple language is no picnic but they make it interesting!
I hope you’re as excited as I am about an ever growing number of teams employing experimentation and uncovering the true causes of user behavior to improve their product decisions. Everyday, we hear from growth teams that really get the value of experimentation. As you scale your growth team, hit us with your questions and we’ll do everything we can to share the best tools, processes, and infrastructure to set you up for success, whether it’s with Statsig or not. Join our Slack channel to also learn from other growth teams who’re cracking new ways to grow their business everyday.
 HEPA or high-efficiency particulate air filters blow air through a fine mesh that catches extremely small particles.
Thanks to our support team, our customers can feel like Statsig is a part of their org and not just a software vendor. We want our customers to know that we're here for them.
Migrating experimentation platforms is a chance to cleanse tech debt, streamline workflows, define ownership, promote democratization of testing, educate teams, and more.
Calculating the right sample size means balancing the level of precision desired, the anticipated effect size, the statistical power of the experiment, and more.
The term 'recency bias' has been all over the statistics and data analysis world, stealthily skewing our interpretation of patterns and trends.
A lot has changed in the past year. New hires, new products, and a new office (or two!) GB Lee tells the tale alongside pictures and illustrations:
A deep dive into CUPED: Why it was invented, how it works, and how to use CUPED to run experiments faster and with less bias.