Products

Solutions

Resources

Docs Pricing

Products

Solutions

Resources

Products

Solutions

Resources

Pricing

Yuzheng Sun, PhD

Data Scientist, Statsig

EXPERIMENTATION STATSIG

Balancing scale, cost, and performance in experimentation systems

Tue Feb 11 2025

A/B testing is easy to start but challenging to scale without a well-designed data platform.

Costs can rise quickly due to the merging of user metrics and exposure logging, a critical yet expensive step in A/B testing.

Beyond cost, a poorly designed system can be error-prone and difficult to debug. As vendors of experimentation platforms, hosting over 70,000 active experiments and processing data from thousands to trillions of events, we’ve learned the importance of a trustworthy system.

This paper presents key observations in designing an elastic and efficient experimentation system (EEES):

Cost: An analysis of major cost components and effective strategies to reduce costs.
Design: Separation of metric definitions from logging to maintain log integrity and enable end-to-end data traceability.
Technologies: Our transition from Databricks to Google BigQuery and in-house solutions, including motivations and trade-offs.

Cost

Handling vast data volumes requires big data technologies like Databricks, Snowflake, and Spark. Managing costs is challenging, so we developed strategies to address this.

Observability is the first step. We created a dashboard and alerting system on Big Query to identify pipeline bottlenecks and allocate resources efficiently. We focus on optimizing cost-performance balance by analyzing both metrics and aligning with Service Level Agreements (SLAs).

Backfilling data due to bugs is costly. Bugs are inevitable but can be prevented. We introduced processes to identify data quality issues early, minimizing the need for backfills. Our custom orchestrator improved task management, reducing costs from duplicate runs.

Resource allocation and orchestration: Most results need daily computation, causing resource spikes. We plan resource allocation and collaborate with providers to ensure availability.

On BigQuery, we separate compute reservations by company size to maintain performance. On Dataproc, we use spot nodes and distribute workloads to ensure node availability.

The key learning for reducing cost is that it’s hard to predict cost ex-ante, but with proper observability and identifying the large chunk of cost, we can continue to reduce cost over time. As the supply chain optimization theory suggests, cost rewards predictability.

Design

Pipelines for computing experimental results include several key components:

Streaming platform: This platform ingests raw exposures and events, ensuring all incoming data is captured in real-time and stored in a raw data layer for further processing.
Imports: When users have events stored in their own data warehouses, pipelines import this data into the raw data layer, creating a unified data source.
Exposures pipeline: Responsible for computing both initial and cumulative exposures for experiments, it aggregates data to track user exposure to different experimental conditions over time.
Metrics pipeline: Computes various metrics, such as funnel conversions and performance indicators, based on user-level raw event data. This transforms raw data into actionable metrics for evaluating experimental outcomes.
Final experimental results: Involves computing results by comparing metrics against exposures, enabling us to assess the impact of different experimental conditions and derive insights.

The key learning is that we should separate raw data (logging) and derived data (metrics). Put all logging into one place as they serve as the source of truth and are fundamental to everything.

Centralize metrics definitions and pipelines to generate metrics. This will avoid data quality deterioration over time.

Technologies

Transitioning from thousands to trillions of events offered insights into big data technologies. We initially used Databricks for its ease of use but faced challenges as we grew. The setup became complex, necessitating a more robust solution.

We moved to BigQuery, which offered improvements with its serverless model and SQL interface. However, pipeline bottlenecks and cost increases emerged. We implemented cost observability and separate compute reservations, saving 50% in costs while meeting SLAs.

Realizing the need for a more advanced solution, we revisited Spark with Apache Iceberg. Iceberg’s Storage Partition Join feature was promising for our resource-intensive pipelines. Migrating to Spark with Iceberg reduced costs by 50% while maintaining SLAs, achieving an optimal cost-performance balance.

Figure 1: Architecture Diagram

To summarize:

The architecture of an EEES is dynamic, and will continue to evolve with new technologies and more optimizations.

We’re sharing our learnings to help others avoid costly mistakes, but the more important takeaway is to build your system to be flexible, observe, and be ready to make changes.

Request a demo

Statsig's experts are on standby to answer any questions about experimentation at your organization.

Grab a Demo

Permalink: https://www.statsig.com/blog/balancing-scale-cost-performance-experimentation-systems

Products

Solutions

Resources

Products

Solutions

Resources

Docs

Pricing

Back to Blog home

Pushpendra Nagtode

Yuzheng Sun, PhD

Balancing scale, cost, and performance in experimentation systems

A/B testing is easy to start but challenging to scale without a well-designed data platform.

Cost

Design

Technologies

To summarize:

Request a demo

Recent Posts

Sink, swim, or scale: What startups teach us about launching AI

Alexey Komissarouk, Yuzheng Sun, PhD

Optimizing cloud compute costs with GKE and compute classes

Pablo Beltran

How Statsig lets you ship, measure, and optimize AI-generated code

Sid Kumar, Brock Lumbard

Your users are your best benchmark: a guide to testing and optimizing AI products

Skye Scofield

The more the merrier? The problem of multiple comparisons in A/B Testing

Allon Korem, Oryah Lancry-Dayan

Randomization: The ABC’s of A/B Testing

Allon Korem, Oryah Lancry-Dayan