How Statsig’s data platform processes hundreds of petabytes daily

Wed Feb 12 2025

Pushpendra Nagtode

Data Engineer, Statsig

At Statsig, we believe in data-driven decision-making at scale.

Our experimentation and analytics platform ingests petabytes of raw data, processes it in real-time and batch, and delivers insights to thousands of companies like OpenAI, Atlassian, Flipkart, Figma and others, ranging from startups to tech giants.

But handling 100PB+ of data daily, supporting trillions of events per day, and optimizing for cost and performance doesn’t come easy.

This article dives into the architectural decisions, scaling challenges, and lessons learned in building one of the largest, most efficient experimentation platforms in the industry.

The scale of the Statsig data platform

Here are some numbers to further illustrate the complexity of the system:

100+ PB of data processed daily across multiple data pipelines and analytics queries

2M+ queries served per day from ETL jobs, ad-hoc queries, and console analytics

2,000+ companies’ data processed daily, ranging from small datasets to trillions of rows

10+ PB total storage footprint, optimized with strict data retention policies

data processed by date
queries served by date

Key challenges

Building a massive data platform comes with serious challenges—from handling unstructured data and scaling efficiently to managing spiky workloads and ensuring data quality.

Here are some of the key hurdles:

Unstructured data

Our platform’s flexibility allows users to log events in JSON format and configure various metrics in the console. Consequently, our system must dynamically parse trillions of raw events based on diverse user configurations to compute experimental results.

Scaling with cost efficiency

Over the past year, our data volumes have increased twentyfold. Scaling efficiently while keeping costs under control remains a significant challenge.

Multiple data ingestion channels

Customers send data through various means, including real-time SDK events, scheduled and on-demand direct imports from customer warehouses across time zones, and blob storage like S3. Managing orchestration across these different sources and thousands of companies is a complex problem requiring a robust scheduling and execution framework.

Spiky workloads

Our workloads are spiky, with 70% of compute needs occurring during early morning PST for computing experimental results for major PST companies. This pattern limits our ability to purchase compute resources upfront at discounted prices.

Maintaining data quality

Detecting anomalies and setting appropriate thresholds is challenging when dealing with thousands of companies with data volumes ranging from a few rows to trillions. For example, we’ve observed some customers where volumes drop 70% over weekends, while others experience spikes during weekends compared to normal days.

Statsig’s data architecture

statsig's data architecture

Data ingestion

Statsig employs a dual approach to data ingestion, catering to both real-time and batch processing needs:

  • Real-time events: Utilizing our in-house streaming infrastructure, events generated via SDKs are ingested in real time. This setup ensures immediate data availability for time-sensitive analyses.

  • Batch imports: We have established dedicated batch import pipelines that connect natively to customer data warehouses. These pipelines facilitate the efficient transfer of data into the Statsig data lake, hosted on Google Cloud Storage (GCS).

ETL processing (Spark + Iceberg)

Our Extract, Transform, Load (ETL) processes are built upon BigQuery, Apache Spark and Apache Iceberg. The ETL pipelines are designed to compute experimental results, accommodating the specific time zones of our customers by running on hourly or daily schedules.

Data warehousing (BigQuery + Iceberg)

Processed data is stored in our data warehousing solutions, combining the strengths of Google BigQuery and Apache Iceberg. This hybrid storage approach enables efficient querying and analysis, supporting both high-performance analytics and large-scale data processing needs.

Data serving and analytics

Access to processed data is facilitated through multiple interfaces:

  • Statsig Console: A user-friendly platform where customers and internal teams can interact with data, configure experiments, and monitor outcomes.

  • Real-time metric explorer: This tool provides immediate insights into key metrics, allowing for dynamic exploration and analysis.

  • Ad-hoc queries: For more customized analyses, users can perform ad-hoc queries, enabling deep dives into specific data subsets as needed.

This architecture ensures that Statsig’s data platform remains robust, scalable, and responsive to the diverse needs of our users.

Evolution of the Statsig data platform

When Statsig first launched, our data infrastructure was built on Databricks (Azure), a choice that worked well in the early days when we had a small set of customers.

Databricks provided a scalable, managed environment for data processing, but as our data volumes grew, inefficiencies started to emerge. The increasing scale of experiments and analytics placed significant pressure on both cost and performance, prompting us to rethink our architecture.

To address these challenges, we migrated to BigQuery, drawn by its scalability, ease of use, and fully managed nature. BigQuery allowed us to quickly analyze large datasets and significantly simplified our ETL processes. However, as adoption increased, so did costs.

The default settings made it easy to run expensive queries, leading to escalating cloud expenses. To manage this, we built custom tooling that optimized reservations and workload separation, reducing costs by 60% while maintaining performance.

As our data volume continued to expand—processing trillions of rows across thousands of customers—BigQuery began to show limitations in large-scale joins, particularly due to shuffle inefficiencies. Queries that required extensive data movement struggled with performance bottlenecks, leading us to explore alternative solutions.

We revisited Apache Iceberg with Spark and conducted a proof-of-concept (POC) using Iceberg’s new partitioned join features. The results were promising, significantly improving performance for massive jobs and reducing computational overhead.

Today, our platform operates in a hybrid model, leveraging BigQuery for analytical workloads and Spark/Iceberg for large-scale data processing.

This architecture allows us to optimize for both cost and performance, ensuring that each workload runs in the most efficient environment.

As we continue to scale, we remain committed to evaluating and adopting emerging technologies that enhance the efficiency, reliability, and cost-effectiveness of our data platform.

Lessons learned (and scaling strategies) for the Statsig data platform

As the Statsig data platform scaled to process 100+ petabytes of data daily, we faced significant technical challenges in performance, cost efficiency, and system flexibility.

To address these, we developed custom solutions tailored to our specific needs. Below are some of the most impactful optimizations and engineering decisions that enabled us to scale effectively.

Custom code generation with Statsig Builder Tool (SBT)

Initially, our pipelines were built using dbt, which served well in the early stages but proved too rigid as our data infrastructure evolved. The need for greater flexibility across SQL, Python, and Spark led us to develop SBT (Statsig Builder Tool)—a custom code generation framework that supports multi-language workflows.

SBT optimizes query execution, cost efficiency, and ETL performance, allowing us to build modular and efficient data pipelines while reducing operational overhead.

Custom orchestration for multi-source ingestion

Our data platform ingests data from multiple sources, including real-time event streaming, scheduled batch imports, and ad-hoc customer uploads.

Traditional workflow orchestrators like Airflow and Dagster lacked the flexibility to handle these diverse ingestion methods efficiently.

To overcome this, we built a custom orchestration system that supports multi-timezone scheduling, workload separation, and dynamic resource allocation. This has enabled us to scale seamlessly across different workloads while maintaining cost efficiency.

BigQuery-specific optimization

BigQuery provides a highly scalable and easy-to-use environment for running large-scale analytical workloads. However, as our data volume grew, cost efficiency and performance optimization became a key focus. We invested significant effort in fine-tuning our jobs and data models to achieve the best possible cost-performance ratio.

One of the most critical aspects of this optimization was understanding how BigQuery’s autoscaler works and correlating query patterns with GCP billing data. By building a detailed query lineage from the GCP billing dashboard, we identified how different queries contributed to costs and pinpointed inefficiencies.

To further optimize performance, we iteratively refined our data models, ensuring the right partitioning and clustering strategies were in place. This helped reduce unnecessary full table scans and improve query execution times.

Additionally, we conducted multiple iterations to identify performance bottlenecks in our ETL jobs, redesigning them to align better with BigQuery’s execution engine and storage model.

Through these optimizations, we significantly improved query performance, reduced storage and compute costs, and enhanced overall system efficiency.

Data quality framework

Maintaining high data quality at scale is critical for ensuring reliable analytics and experimentation.

To detect anomalies and prevent bad data from propagating through our pipelines, we built a custom data quality framework that seamlessly integrates with our data pipeline DAGs.

Our framework is designed to automatically validate incoming data, apply custom rules tailored to each dataset, and either alert or block pipeline execution in case of quality issues. Over multiple iterations, we have fine-tuned thresholds and validation mechanisms based on company-specific requirements, ensuring a balance between false positives and genuine anomalies.

One of the key aspects of this system is its modular design, allowing us to easily add and modify validation rules as our data evolves. By integrating these checks directly into our ETL workflows, we ensure that bad data is caught before it impacts downstream analytics and experimentation results.

This framework has been instrumental in maintaining data integrity, reducing the operational burden on on-call engineers, and improving trust in our platform’s data reliability.

Through continuous iteration, we have optimized our anomaly detection algorithms and validation processes, making data quality a proactive, automated safeguard rather than a reactive firefighting effort.

Cumulative design for long-running experiments

Many of our customers run experiments that span several months, generating enormous datasets that require continuous processing.

Instead of scanning months of data repeatedly, we adopted a cumulative computation model that stores incremental results, drastically reducing query execution times and storage costs.

Cost observability and optimization

BigQuery offers exceptional query performance, but without proactive cost management, expenses can quickly spiral out of control.

To ensure efficient resource utilization, we built a cost observability framework that allows us to:

  • Track cost per company and workload, enabling precise chargeback models

  • Identify anomalies and inefficiencies in query execution and storage usage

  • Optimize query routing by dynamically adjusting workloads to different BigQuery reservations based on compute needs

  • Conduct regular “war room” sessions and cost-focus weeks to continuously refine our optimization strategies

Iceberg Storage Partition Joins for scalable data processing

One of the most expensive operations in distributed big data computing is data shuffling, particularly when joining massive datasets like in one of jobs where we join datasets exceeding 100 terabytes (2 Trillion rows).

To mitigate shuffle inefficiencies, we implemented Iceberg Storage Partition Joins, which significantly reduces the network and compute overhead associated with large-scale joins. This improvement has allowed us to process massive datasets efficiently without incurring excessive costs.

Optimizing Spark performance with spot nodes

To further enhance cost efficiency in Spark workloads, we leverage spot nodes on Google Cloud Dataproc, Google’s managed Spark service.

By dynamically allocating low-cost, preemptible instances, we achieve substantial cost reductions without sacrificing performance. This optimization has been a game-changer, enabling us to run large-scale data transformations at a fraction of the cost compared to on-demand instances.

The results

Scaling the Statsig Data Platform has required constant iteration, architectural innovation, and deep cost optimization strategies.

By building custom orchestration tools, optimizing query execution, and leveraging efficient data storage and processing techniques, we have created a highly scalable and cost-effective experimentation platform.

As we continue to push the limits of big data infrastructure, we remain focused on exploring new technologies and refining our approach to deliver the best possible performance at scale.

Request a demo

Statsig's experts are on standby to answer any questions about experimentation at your organization.
request a demo cta image

Related reading:


Try Statsig Today

Get started for free. Add your whole team!
We use cookies to ensure you get the best experience on our website.
Privacy Policy