Products

Solutions

Resources

Docs Pricing

Products

Solutions

Resources

Products

Solutions

Resources

How we created count distinct in Statsig Cloud

Thu Aug 28 2025

When I joined Statsig, I spent my first week reading through customer requests. Almost immediately, a pattern jumped out to me. Teams kept asking some version of the same question: how many unique things did a user interact with over time?

Unique artists in the first 7 days. Unique brands per buyer. Unique features used after onboarding. The need was clear. Distinct counts had to work in Experiments and Pulse with a single, reliable definition.

If you have asked “How many unique songs did a listener play last week?” or “How many unique users bought from each brand?”, then you too needed a distinct count. Our new Count Distinct metric on Statsig Cloud gives you that answer quickly and consistently.

With Count Distinct, you can turn raw events into signals like:

Distinct artists listened per user
Distinct SKUs purchased per user
Distinct search queries issued per user
Distinct repositories pushed per user
Distinct merchants paid per user

To do so, define the event and the unique field to count, choose a time window, add dimensions (optional), and you're done. The same definition then flows into Experiments and Pulse.

The pattern behind the questions

Breadth matters. Distinct counts capture variety and exploration. A streaming app can measure catalog breadth per user. A marketplace can see how many brands a buyer considered. A SaaS product can track how many projects an account actually uses.

Simple to set up. Create a metric in the UI, choose the event and property, and ship.

One definition everywhere. Same answer in Experiments and Pulse.

Built for decisions. Use Count Distinct to evaluate changes that should increase discovery or variety, not just total volume.

Why now

Our cloud pipelines were hyper-optimized to compute on top of daily aggregates. That design is perfect for sums and averages, where you can add yesterday to today. Distinct counts break that assumption because you must dedupe across days.

If you simply add daily distincts, you overcount. We needed a representation that carries forward what has already been seen while staying small and fast. Sketches give exactly that.

The tool that made it click: sketches

In data science lingo, a sketch is a probabilistic summary that answers certain queries approximately, very fast, and with little memory. The flow is simple: create, merge, extract.

Create. Build a sketch per user per day. This sketch holds that day’s distinct values.

Merge. Combine sketches to dedupe across a timeframe.

Mon: viewed {A, B}
Tue: viewed {B, C}
Wed: viewed {A}
If you summed daily distincts you would get 2 + 2 + 1 = 5.
Merging the three sketches yields {A, B, C}, which is 3.

Extract. Turn the binary sketch into a float. That number then feeds mean, variance, and confidence interval calculations.

This pattern makes multi-day reads fast and consistent.

How I built it

I wanted the customer experience to be simple and the pipeline to be reliable. Here is how I got there.

Prototype where the data already lives

A lot of our upstream modeling for Statsig metric pipelines runs in BigQuery, so the first version lived there. I produced daily per-user sketches for each metric, then merged them for group results. Reads were fast, long rollups stayed stable, and checking against exact sample counts was straightforward.

Why I moved to Spark

Many of our downstream models and checks run in Spark, so building Count Distinct there made the flow simpler and easier for debugging. It was also a deliberate step in our initiative to move all pipelines to Spark.

Moreover, BigQuery and Spark store sketches differently, so I built a native Spark version that matches BigQuery’s results, using a small set of Spark helpers and UDFs to create, merge, and estimate from the sketches.

Making the switch without changing the customer experience

The main hurdle was sketch portability. Sketches produced in BigQuery are not directly readable in Spark, and the two engines expose different functions.

To keep results the same for customers, I wrote engine-specific wrappers and UDFs in Spark that mirror BigQuery behavior. I validated them on shared samples until the two paths agreed within the expected error.

Our bridge between engines

I kept the core model in Spark SQL and stored each day’s sketch as a base64 string in Parquet on GCS so it can safely move through BigQuery tables when needed.
On the Spark side, I decode that field back into a native sketch and continue merges and extraction with the Spark UDFs and helpers.
The wrappers hide engine differences — so the definition stays stable and the numbers match.

This preserved the daily merge pattern and avoided surprises during the shift to Spark.

Keep it incremental and dependable

Distinct counts do not follow a simple running total. To keep jobs predictable and backfills straightforward, I store one sketch per user per day and merge at rollup time. That keeps experiment reads steady and Pulse views consistent.

During testing

I saw near-zero deltas at small cardinalities because the sketch stayed in sparse mode, which acts like a short guest list where you write down every unique name and get an exact count. As volume grew, the algorithm crossed its internal threshold into dense mode, which is more like switching to a compact tally card that is fast and space-efficient. At that point, the estimate picked up a small, predictable error that converged to what the documentation describes.

Precision was a clear tradeoff between cost and accuracy. I ran simulations across precision levels and spoke with customers about how they plan to use the metric, then chose a default that keeps early windows near-exact while preserving fast reads. The result is a setting that matches real usage and helps teams get the most out of the feature.

What I learned

Speed has a cost. Sketches are fast to read, but they are computationally intensive to build and merge. They work best in a hyper-optimized environment that keeps compute close to the data and minimizes joins. That setup makes writes heavier and reads trivial, which gives fast reads and efficient results.
Small counts look exact. For small cardinalities relative to precision, the sketch delta was near zero because the sketch stays in sparse mode and keeps an exact count. Once it flips to dense mode, the error rate rises to what the Spark and BigQuery docs describe.
Precision is a knob. Higher precision means more compute, slower pipelines, and a more memory-dense sketch, but better accuracy. Lower precision means faster reads, fewer resources, and smaller storage, but more error.
Portability matters. Keeping the sketch as a binary blob lets it move between Spark and BigQuery.
Room to grow. Sketches open the door. Our count distinct approach gives Statsig a fast path to percentile-based metrics.

What this is not

This is not a replacement for your daily “users who triggered event” metric. Count Distinct is for unique items tied to an entity, such as unique songs per user or unique users per product.

What you can do today

Experimentation

Ask “Did the new playlist page increase unique artists played in the first 7 days?”
Compare treatment and control on unique products purchased per buyer.
Evaluate unique features used per account after onboarding changes.

Getting started

Create a metric. In Metrics, click Create and pick Count Distinct.
Choose the input. Select the event and the property you care about, for example song_id, sku, brand_id, or error_code.
Add dimensions. Platform, country, or product line, if you need them.
Use it everywhere. The same metric works in Experiments and Pulse.

The bottom line

Count Distinct on Statsig Cloud makes “How many unique X?” a one-click metric. It is fast, consistent, and ready for Experiments and Pulse. You get a single definition that scales, works across time and partitions, and stays true to the question you asked.

Permalink: https://www.statsig.com/blog/how-we-created-count-distinct-in-statsig-cloud

Products

Solutions

Resources

Products

Solutions

Resources

Docs

Pricing

Back to Blog home

Aamodit Acharya

How we created count distinct in Statsig Cloud

The pattern behind the questions

Why now

The tool that made it click: sketches

How I built it

Prototype where the data already lives

Why I moved to Spark

Making the switch without changing the customer experience

Our bridge between engines

Keep it incremental and dependable

During testing

What I learned

What this is not

What you can do today

Experimentation

Getting started

The bottom line

Recent Posts

Introducing the Statsig partner program: Powering innovation through a unified ecosystem of builders

William da Cunha, Matt Lewis

Profiling Server Core: How we cut memory usage by 85%

Daniel Loomb

Correct me if I'm wrong: Navigating multiple comparison corrections in A/B Testing

Allon Korem

2 Events, 2 Audiences, 2 Tones. 1 Statsig.

Jessie Ong

Experiments with AI in the Creative Process

Cat Lee

Helping customers move faster: the story behind Statsig University

Julie Leary