Costs can rise quickly due to the merging of user metrics and exposure logging, a critical yet expensive step in A/B testing.
Beyond cost, a poorly designed system can be error-prone and difficult to debug. As vendors of experimentation platforms, hosting over 70,000 active experiments and processing data from thousands to trillions of events, we’ve learned the importance of a trustworthy system.
This paper presents key observations in designing an elastic and efficient experimentation system (EEES):
Cost: An analysis of major cost components and effective strategies to reduce costs.
Design: Separation of metric definitions from logging to maintain log integrity and enable end-to-end data traceability.
Technologies: Our transition from Databricks to Google BigQuery and in-house solutions, including motivations and trade-offs.
Handling vast data volumes requires big data technologies like Databricks, Snowflake, and Spark. Managing costs is challenging, so we developed strategies to address this.
Observability is the first step. We created a dashboard and alerting system on Big Query to identify pipeline bottlenecks and allocate resources efficiently. We focus on optimizing cost-performance balance by analyzing both metrics and aligning with Service Level Agreements (SLAs).
Backfilling data due to bugs is costly. Bugs are inevitable but can be prevented. We introduced processes to identify data quality issues early, minimizing the need for backfills. Our custom orchestrator improved task management, reducing costs from duplicate runs.
Resource allocation and orchestration: Most results need daily computation, causing resource spikes. We plan resource allocation and collaborate with providers to ensure availability.
On BigQuery, we separate compute reservations by company size to maintain performance. On Dataproc, we use spot nodes and distribute workloads to ensure node availability.
The key learning for reducing cost is that it’s hard to predict cost ex-ante, but with proper observability and identifying the large chunk of cost, we can continue to reduce cost over time. As the supply chain optimization theory suggests, cost rewards predictability.
Pipelines for computing experimental results include several key components:
Streaming platform: This platform ingests raw exposures and events, ensuring all incoming data is captured in real-time and stored in a raw data layer for further processing.
Imports: When users have events stored in their own data warehouses, pipelines import this data into the raw data layer, creating a unified data source.
Exposures pipeline: Responsible for computing both initial and cumulative exposures for experiments, it aggregates data to track user exposure to different experimental conditions over time.
Metrics pipeline: Computes various metrics, such as funnel conversions and performance indicators, based on user-level raw event data. This transforms raw data into actionable metrics for evaluating experimental outcomes.
Final experimental results: Involves computing results by comparing metrics against exposures, enabling us to assess the impact of different experimental conditions and derive insights.
The key learning is that we should separate raw data (logging) and derived data (metrics). Put all logging into one place as they serve as the source of truth and are fundamental to everything.
Centralize metrics definitions and pipelines to generate metrics. This will avoid data quality deterioration over time.
Transitioning from thousands to trillions of events offered insights into big data technologies. We initially used Databricks for its ease of use but faced challenges as we grew. The setup became complex, necessitating a more robust solution.
We moved to BigQuery, which offered improvements with its serverless model and SQL interface. However, pipeline bottlenecks and cost increases emerged. We implemented cost observability and separate compute reservations, saving 50% in costs while meeting SLAs.
Realizing the need for a more advanced solution, we revisited Spark with Apache Iceberg. Iceberg’s Storage Partition Join feature was promising for our resource-intensive pipelines. Migrating to Spark with Iceberg reduced costs by 50% while maintaining SLAs, achieving an optimal cost-performance balance.
The architecture of an EEES is dynamic, and will continue to evolve with new technologies and more optimizations.
We’re sharing our learnings to help others avoid costly mistakes, but the more important takeaway is to build your system to be flexible, observe, and be ready to make changes.
Find out how we scaled our data platform to handle hundreds of petabytes of data per day, and our specific solutions to the obstacles we've faced while scaling. Read More ⇾
The debate between Bayesian and frequentist statistics sounds like a fundamental clash, but it's more about how we talk about uncertainty than the actual decisions we make. Read More ⇾
Here's how we optimized store cloning, cut processing time from 500ms to 2ms, and engineered FastCloneMap for blazing-fast entity updates. Read More ⇾
It's one thing to have a really great and functional product. It's another thing to have a product that feels good to use. Read More ⇾
Stratified sampling enhances A/B tests by reducing variance and improving group balance for more reliable results. Read More ⇾
The authoritative guide on the design and implementation of an in-house feature flagging and AB test assignment platform. Read More ⇾