The Causal Roundup #2

Anu Sharma
Wed Oct 13 2021

Processes and Infrastructure

The Causal Roundup is a biweekly review of industry leading work in causality. From experimentation to causal inference, we share work from teams who are building the future of product decision making. In this week’s edition, we focus on processes and infrastructure - the force multipliers for every team.

Raising the bar on product decisions 💪

A key aspect of building an experimentation culture is standardizing how different teams execute and interpret experiments. Earlier this summer, Booking explained how they get to the heart of the issue: Running bad experiments is just a very expensive and convoluted way to make unreliable decisions.

What really matters to us is not how many product decisions are made, nor how fast decisions are made, but how good those decisions are.

They define a three-point rubric to assess each decision based on design, execution, and shipping criteria. Design establishes the fundamentals by ensuring that teams pay attention to the power of an experiment and record the outcomes they expect. Execution ensures that teams don’t compromise on the duration of the experiment. Shipping formalizes the go/no-go decision based on pre-established shipping criteria.

At a tactical level, ratings for these three dimensions track the performance of a team and department over time. More importantly, at a strategic level, the team responsible for the experimentation platform measures how they’re influencing the quality of company-wide decisions and how their customers (internal teams) are using the platform tools. This enables them to constantly improve their tooling to better serve company-wide objectives around the quality of product decisions.

While Booking.com has shared a lot of awesome work on product experimentation, this is the only instance we’ve seen in the wild of a company setting and raising the bar on their decision making process.

Standardizing Data Consumption 🛤

Before going into our next story, it’s worth recounting four broad patterns that we see for serving measurable properties of a system to users:

  1. Offline Metrics for Reporting and Experimentation: These are computed for a given period and feed into regularly generated reports or experiment analysis. Aside from the pressure of delivering reports on time, the system serving offline metrics generally bears no latency-based constraints.
  2. Interactive Analytics for Exploration: When data analysts or scientists want to roll up their sleeves to explore the data, they use predefined dimension cuts via a dashboard or an interactive query interface that returns data within a few seconds.
  3. Feature Backfill for Model Training: Computing features for model training is similar to computing offline metrics with one additional constraint: point-in-time correctness for features that must be historically accurate. For example, a model may use a feature that counts a given user’s 5 min login count at 11pm a month ago.
  4. Feature Serving for Online Inference: When machine learning models use live features to construct the user experience in real-time, say offering recommendations on where to stay in a city, this requires a row-oriented storage layout that can serve read latencies at the order of ~10ms.

With me so far? Now on to the story…

As Airbnb scaled, their leaders found that different teams consuming the same application data reported different numbers for simple business metrics. And there was no easy way to know which number was correct. To create one source of truth and use it everywhere, the team built a metrics platform, Minerva.

Define metrics once, and use them everywhere

This metrics platform serves the top two needs I mentioned above: reporting and analytics. However, unlike common reporting and analytics use cases, reporting for experimentation is unique because metrics are only a starting point. We must first join these metrics with user assignment data from the experiment and then compute summary statistics for analysis. Minerva supplies the “raw events” to Airbnb’s Experimentation Reporting Framework (ERF), joins the raw event data with the assignment data, and then ERF calculates the summary statistics. This is exactly the same as many of our customers who record experiment event data as well as metrics from their data warehouse into Statsig to analyze their experiments. Cool!

Looking further into Airbnb’s data management infrastructure (and this is the fun part!)… Minerva is Airbnb’s metric store that serves the first two needs, and Zipline is Airbnb’s feature store that serves the last two needs. There’s significant overlap between the two, particularly in performing long-running offline computations. So I was tickled when I heard about Ziperva, the new converged data store, that’s enjoying successful alpha at Airbnb. Unifying and scaling data management across the company: let’s put a pin on that and come back to it in a future edition! 📍

Experiments Save Lives 

At Statsig, we talk a lot about experiments. A bit off the beaten track, this experiment is truly about saving lives.

To determine the efficacy of portable air filters (HEPA filters¹) in clearing SARS-CoV-2, a U.K. team installed these filters in two fully occupied COVID-19 wards: a general ward and an ICU. The team collected air samples from these wards for a week with the air filters switched on, and then for two weeks with the filters turned off.

They found SARS-CoV-2 particles in the air when the filter was off but not when it was on. Also surprisingly, the team didn’t find many viral particles in the air of the ICU ward, even when the filter there was off. Here, it’s a quick read! If only we could run more experiments, we would answer a lot more questions 😀

Elsewhere in causal land…

  • This year’s Nobel prize for Economics went to David Card, Joshua Angrist, and Guido Imbens for their contributions to the analysis of causal relationships using natural experiments. While Card has analyzed key societal questions such as the impact of immigration and minimum wages on employment and jobs, Angrist and Imbens have developed new methods to show that natural experiments are rich sources of knowledge to answer such societal questions. Are you there yet on going all in on causal relationships and experimentation?!
  • LinkedIn explains their end-to-end explainability system, Intellige, that answers the critical “so what?” questions about machine learning model predictions. While the current state of art in model explainability identifies the top features that influence model predictions, Intellige offers the rationale behind model predictions to make them actionable for users.
  • Teads, a global media platform, talks about their A/B testing analysis framework and the infrastructure behind it. It (a) performs pre-aggregation of logs (Spark), (b) runs a query engine (Amazon Athena), and (c ) publishes results on a dashboard. An improvement over previous analysis tools (Jupyter Notebooks, BigQuery), there’s a lot to like here about building a reliable architecture for analysis even if it’s a bit lighter on the user assignment component (check out the bigger picture if you’re assessing building your own platform vs. buying a service).
  • Netflix has a lovely blog post on building intuition for statistical significance by flipping coins. Explaining p-value in simple language is no picnic but they make it interesting!

I hope you’re as excited as I am about an ever growing number of teams employing experimentation and uncovering the true causes of user behavior to improve their product decisions. Everyday, we hear from growth teams that really get the value of experimentation. As you scale your growth team, hit us with your questions and we’ll do everything we can to share the best tools, processes, and infrastructure to set you up for success, whether it’s with Statsig or not. Join our Slack channel to also learn from other growth teams who’re cracking new ways to grow their business everyday.

[1] HEPA or high-efficiency particulate air filters blow air through a fine mesh that catches extremely small particles.


Try Statsig Today

Explore Statsig’s smart feature gates with built-in A/B tests, or create an account instantly and start optimizing your web and mobile applications. You can also schedule a live demo or chat with us to design a custom package for your business.

MORE POSTS

Recently published

My Summer as a Statsig Intern

RIA RAJAN

This summer I had the pleasure of joining Statsig as their first ever product design intern. This was my first college internship, and I was so excited to get some design experience. I had just finished my freshman year in college and was still working on...

Read more

Long-live the 95% Confidence Interval

TIMOTHY CHAN

The 95% confidence interval currently dominates online and scientific experimentation; it always has. Yet it’s validity and usefulness is often questioned. It’s called too conservative by some [1], and too permissive by others. It’s deemed arbitrary...

Read more

Realtime Product Observability with Apache Druid

JASON WANG

Statsig’s Journey with Druid This is the text version of the story that we shared at Druid Summit Seattle 2022. Every feature we build at Statsig serves a common goal — to help you better know about your product, and empower you to make good decisions for...

Read more

Quant vs. Qual

MARGARET-ANN SEGER

💡 How to decide between leaning on data vs. research when diagnosing and solving product problems Four heuristics I’ve found helpful when deciding between data vs. research to diagnose + solve a problem. Earth image credit of Moncast Drawing. As a PM, data...

Read more

The Importance of Default Values

TORE

Have you ever sent an email to the wrong person? Well I have. At work. From a generic support email address. To a group of our top customers. Facepalm. In March of 2018, I was working on the games team at Facebook. You may remember that month as a tumultuous...

Read more
ANNOUNCEMENT

CUPED on Statsig

CRAIG

Run experiments with more speed and accuracy We’re pleased to announce the rollout of CUPED for all our customers. Statsig will now automatically use CUPED to reduce variance and bias on experiments’ key metrics. This gives you access to a powerful experiment...

Read more

We use cookies to ensure you get the best experience on our website.

Privacy Policy