Our experimentation and analytics platform ingests petabytes of raw data, processes it in real-time and batch, and delivers insights to thousands of companies like OpenAI, Atlassian, Flipkart, Figma and others, ranging from startups to tech giants.
But handling 100PB+ of data daily, supporting trillions of events per day, and optimizing for cost and performance doesn’t come easy.
This article dives into the architectural decisions, scaling challenges, and lessons learned in building one of the largest, most efficient experimentation platforms in the industry.
Here are some numbers to further illustrate the complexity of the system:
• 100+ PB of data processed daily across multiple data pipelines and analytics queries
• 2M+ queries served per day from ETL jobs, ad-hoc queries, and console analytics
• 2,000+ companies’ data processed daily, ranging from small datasets to trillions of rows
• 10+ PB total storage footprint, optimized with strict data retention policies
Building a massive data platform comes with serious challenges—from handling unstructured data and scaling efficiently to managing spiky workloads and ensuring data quality.
Here are some of the key hurdles:
Our platform’s flexibility allows users to log events in JSON format and configure various metrics in the console. Consequently, our system must dynamically parse trillions of raw events based on diverse user configurations to compute experimental results.
Over the past year, our data volumes have increased twentyfold. Scaling efficiently while keeping costs under control remains a significant challenge.
Customers send data through various means, including real-time SDK events, scheduled and on-demand direct imports from customer warehouses across time zones, and blob storage like S3. Managing orchestration across these different sources and thousands of companies is a complex problem requiring a robust scheduling and execution framework.
Our workloads are spiky, with 70% of compute needs occurring during early morning PST for computing experimental results for major PST companies. This pattern limits our ability to purchase compute resources upfront at discounted prices.
Detecting anomalies and setting appropriate thresholds is challenging when dealing with thousands of companies with data volumes ranging from a few rows to trillions. For example, we’ve observed some customers where volumes drop 70% over weekends, while others experience spikes during weekends compared to normal days.
Statsig employs a dual approach to data ingestion, catering to both real-time and batch processing needs:
Real-time events: Utilizing our in-house streaming infrastructure, events generated via SDKs are ingested in real time. This setup ensures immediate data availability for time-sensitive analyses.
Batch imports: We have established dedicated batch import pipelines that connect natively to customer data warehouses. These pipelines facilitate the efficient transfer of data into the Statsig data lake, hosted on Google Cloud Storage (GCS).
Our Extract, Transform, Load (ETL) processes are built upon BigQuery, Apache Spark and Apache Iceberg. The ETL pipelines are designed to compute experimental results, accommodating the specific time zones of our customers by running on hourly or daily schedules.
Processed data is stored in our data warehousing solutions, combining the strengths of Google BigQuery and Apache Iceberg. This hybrid storage approach enables efficient querying and analysis, supporting both high-performance analytics and large-scale data processing needs.
Access to processed data is facilitated through multiple interfaces:
Statsig Console: A user-friendly platform where customers and internal teams can interact with data, configure experiments, and monitor outcomes.
Real-time metric explorer: This tool provides immediate insights into key metrics, allowing for dynamic exploration and analysis.
Ad-hoc queries: For more customized analyses, users can perform ad-hoc queries, enabling deep dives into specific data subsets as needed.
This architecture ensures that Statsig’s data platform remains robust, scalable, and responsive to the diverse needs of our users.
When Statsig first launched, our data infrastructure was built on Databricks (Azure), a choice that worked well in the early days when we had a small set of customers.
Databricks provided a scalable, managed environment for data processing, but as our data volumes grew, inefficiencies started to emerge. The increasing scale of experiments and analytics placed significant pressure on both cost and performance, prompting us to rethink our architecture.
To address these challenges, we migrated to BigQuery, drawn by its scalability, ease of use, and fully managed nature. BigQuery allowed us to quickly analyze large datasets and significantly simplified our ETL processes. However, as adoption increased, so did costs.
The default settings made it easy to run expensive queries, leading to escalating cloud expenses. To manage this, we built custom tooling that optimized reservations and workload separation, reducing costs by 60% while maintaining performance.
As our data volume continued to expand—processing trillions of rows across thousands of customers—BigQuery began to show limitations in large-scale joins, particularly due to shuffle inefficiencies. Queries that required extensive data movement struggled with performance bottlenecks, leading us to explore alternative solutions.
We revisited Apache Iceberg with Spark and conducted a proof-of-concept (POC) using Iceberg’s new partitioned join features. The results were promising, significantly improving performance for massive jobs and reducing computational overhead.
Today, our platform operates in a hybrid model, leveraging BigQuery for analytical workloads and Spark/Iceberg for large-scale data processing.
This architecture allows us to optimize for both cost and performance, ensuring that each workload runs in the most efficient environment.
As we continue to scale, we remain committed to evaluating and adopting emerging technologies that enhance the efficiency, reliability, and cost-effectiveness of our data platform.
As the Statsig data platform scaled to process 100+ petabytes of data daily, we faced significant technical challenges in performance, cost efficiency, and system flexibility.
To address these, we developed custom solutions tailored to our specific needs. Below are some of the most impactful optimizations and engineering decisions that enabled us to scale effectively.
Initially, our pipelines were built using dbt, which served well in the early stages but proved too rigid as our data infrastructure evolved. The need for greater flexibility across SQL, Python, and Spark led us to develop SBT (Statsig Builder Tool)—a custom code generation framework that supports multi-language workflows.
SBT optimizes query execution, cost efficiency, and ETL performance, allowing us to build modular and efficient data pipelines while reducing operational overhead.
Our data platform ingests data from multiple sources, including real-time event streaming, scheduled batch imports, and ad-hoc customer uploads.
Traditional workflow orchestrators like Airflow and Dagster lacked the flexibility to handle these diverse ingestion methods efficiently.
To overcome this, we built a custom orchestration system that supports multi-timezone scheduling, workload separation, and dynamic resource allocation. This has enabled us to scale seamlessly across different workloads while maintaining cost efficiency.
BigQuery provides a highly scalable and easy-to-use environment for running large-scale analytical workloads. However, as our data volume grew, cost efficiency and performance optimization became a key focus. We invested significant effort in fine-tuning our jobs and data models to achieve the best possible cost-performance ratio.
One of the most critical aspects of this optimization was understanding how BigQuery’s autoscaler works and correlating query patterns with GCP billing data. By building a detailed query lineage from the GCP billing dashboard, we identified how different queries contributed to costs and pinpointed inefficiencies.
To further optimize performance, we iteratively refined our data models, ensuring the right partitioning and clustering strategies were in place. This helped reduce unnecessary full table scans and improve query execution times.
Additionally, we conducted multiple iterations to identify performance bottlenecks in our ETL jobs, redesigning them to align better with BigQuery’s execution engine and storage model.
Through these optimizations, we significantly improved query performance, reduced storage and compute costs, and enhanced overall system efficiency.
Maintaining high data quality at scale is critical for ensuring reliable analytics and experimentation.
To detect anomalies and prevent bad data from propagating through our pipelines, we built a custom data quality framework that seamlessly integrates with our data pipeline DAGs.
Our framework is designed to automatically validate incoming data, apply custom rules tailored to each dataset, and either alert or block pipeline execution in case of quality issues. Over multiple iterations, we have fine-tuned thresholds and validation mechanisms based on company-specific requirements, ensuring a balance between false positives and genuine anomalies.
One of the key aspects of this system is its modular design, allowing us to easily add and modify validation rules as our data evolves. By integrating these checks directly into our ETL workflows, we ensure that bad data is caught before it impacts downstream analytics and experimentation results.
This framework has been instrumental in maintaining data integrity, reducing the operational burden on on-call engineers, and improving trust in our platform’s data reliability.
Through continuous iteration, we have optimized our anomaly detection algorithms and validation processes, making data quality a proactive, automated safeguard rather than a reactive firefighting effort.
Many of our customers run experiments that span several months, generating enormous datasets that require continuous processing.
Instead of scanning months of data repeatedly, we adopted a cumulative computation model that stores incremental results, drastically reducing query execution times and storage costs.
BigQuery offers exceptional query performance, but without proactive cost management, expenses can quickly spiral out of control.
To ensure efficient resource utilization, we built a cost observability framework that allows us to:
Track cost per company and workload, enabling precise chargeback models
Identify anomalies and inefficiencies in query execution and storage usage
Optimize query routing by dynamically adjusting workloads to different BigQuery reservations based on compute needs
Conduct regular “war room” sessions and cost-focus weeks to continuously refine our optimization strategies
One of the most expensive operations in distributed big data computing is data shuffling, particularly when joining massive datasets like in one of jobs where we join datasets exceeding 100 terabytes (2 Trillion rows).
To mitigate shuffle inefficiencies, we implemented Iceberg Storage Partition Joins, which significantly reduces the network and compute overhead associated with large-scale joins. This improvement has allowed us to process massive datasets efficiently without incurring excessive costs.
To further enhance cost efficiency in Spark workloads, we leverage spot nodes on Google Cloud Dataproc, Google’s managed Spark service.
By dynamically allocating low-cost, preemptible instances, we achieve substantial cost reductions without sacrificing performance. This optimization has been a game-changer, enabling us to run large-scale data transformations at a fraction of the cost compared to on-demand instances.
Scaling the Statsig Data Platform has required constant iteration, architectural innovation, and deep cost optimization strategies.
By building custom orchestration tools, optimizing query execution, and leveraging efficient data storage and processing techniques, we have created a highly scalable and cost-effective experimentation platform.
As we continue to push the limits of big data infrastructure, we remain focused on exploring new technologies and refining our approach to deliver the best possible performance at scale.
Related reading:
The debate between Bayesian and frequentist statistics sounds like a fundamental clash, but it's more about how we talk about uncertainty than the actual decisions we make. Read More ⇾
Building a scalable experimentation platform means balancing cost, performance, and flexibility. Here’s how we designed an elastic, efficient, and powerful system. Read More ⇾
Here's how we optimized store cloning, cut processing time from 500ms to 2ms, and engineered FastCloneMap for blazing-fast entity updates. Read More ⇾
It's one thing to have a really great and functional product. It's another thing to have a product that feels good to use. Read More ⇾
Stratified sampling enhances A/B tests by reducing variance and improving group balance for more reliable results. Read More ⇾
The authoritative guide on the design and implementation of an in-house feature flagging and AB test assignment platform. Read More ⇾