The Statsig Pulse results page offers a snapshot of all the metric movements driven by an experiment. Sometimes, a brief scan of the color coded score card is enough to validate that all metrics behave as expected, and we quickly proceed with the launch. Other times, however, a more detailed understanding is required before deciding on next steps.
Time series charts can reveal insights otherwise hidden in fully aggregated results, such as seasonality and novelty effects. Different types of time series are available, and which one we use depends on the question we want to answer. Here we share an insiders guide this Pulse feature, how to use it, and why.
This time series shows the metric impact broken down by the number of days a user has been in the experiment. It’s the best way to answer questions like:
Does my experiment have a novelty effect? Do users try out the new feature once and never again?
Is there pre-experiment bias in this metric? Was that lift there even before we launched the feature?
Day 0 is the day a user becomes part of the experiment, which is often the first time they see the new feature. Metric deltas that are significant early on and turn neutral with increasing tenure are indicative of a novelty effect: Users are engaging with a feature because it’s new, they’re curious. Once they try it out they lose interest and the impact is not sustained in the long run.
In the example below, moving a button to a more prominent location increased the number of clicks by 2,000%, but only on Day 0. After that, the effect is neutral. If we were hoping for a sustained lift, we should think twice before shipping this change.
Setting Key Metrics for an experiment unlocks an additional benefit of the days since exposure chart. For this set of metrics, we also show the impact during the 7 days prior to a user joining the experiment. This is a convenient way to check whether there was a difference between the test and control groups even before the experiment started.
Imagine we’re dealing with a metric that shows a significant regression. Naturally, we wonder whether this is truly caused by our experiment, or perhaps we got unlucky in our group allocation. The chart below shows that the difference between test and control is neutral before the experiment starts, suddenly drops on Day 0, and remains negative on subsequent days. With this, we can rule out pre-experiment bias as the root cause.
This view shows the metric impact on each calendar day without aggregating days together. It’s a good one to check if we have concerns such as:
Does the feature have a different impact on weekends vs. weekdays?
Did yesterday’s server crash impact our experiment?
The daily time series also provides some insight into the variability of the effect day over day. When a metric has a statistically significant effect that we can’t explain, it reveals whether this effect is consistent, or primarily driven by one or two outlier days. In the latter scenario, we may choose to run the experiment for an additional week or investigate what happened on those days.
Below is an example of a metric that, unexpectedly, showed statistically significant lift. The daily time series shows that the metric is quite noisy and neutral on most days, but April 27 is a significant outlier. We take this lift with a grain of salt, knowing that it’s likely a false positive caused by random noise.
Another valuable use-case for daily time series is monitoring and evaluating holdouts, which are used to measure the impact of many features typically released over the course of several months.
While the daily time series often looks noisy and can have large confidence intervals, a cumulative view reveals how the aggregated metric lift and confidence intervals evolve over time as the experiment progresses. This comes in handy when wondering:
Do we expect confidence intervals to shrink if we run the experiment longer?
The behavior of confidence intervals over time depends on several factors: Influx of new users into the experiment, variance of the metric, sensitivity to user tenure, etc. The cumulative time series helps inform whether waiting longer could help gain higher confidence in the results.
The chart below shows how the confidence intervals for this metric are reduced by half during the first week of the experiment. It’s also evident that the both the effect and confidence intervals have been stable for the past few weeks, and we’re unlikely to gain new insights by running the experiment longer.
Diving into time series, we may be concerned about information overload. The metric lifts in Pulse are straight forward to interpret, but slicing and dicing by days introduces gray areas and opens the door to p-hacking. Keep in mind that this tool exists to help check your assumptions, not to scavenge for impact or even to make every decision bullet proof.
In online experimentation we want to move fast without overlooking key data points that might lead us in a different direction. How deep we go in the analysis depends on the scope of the decision and how much weight we place on specific results. Pulse time series are readily available to ease the burden of these deep dives. Be sure to check them out as needed, keeping in mind some Do’s and Don’t’s.
Here’s how to get to the time series views in Pulse:
Explore Statsig’s smart feature gates with built-in A/B tests, or create an account instantly and start optimizing your web and mobile applications. You can also schedule a live demo or chat with us to design a custom package for your business.
💡 How to decide between leaning on data vs. research when diagnosing and solving product problems Four heuristics I’ve found helpful when deciding between data vs. research to diagnose + solve a problem. Earth image credit of Moncast Drawing. As a PM, data...
Have you ever sent an email to the wrong person? Well I have. At work. From a generic support email address. To a group of our top customers. Facepalm. In March of 2018, I was working on the games team at Facebook. You may remember that month as a tumultuous...
Run experiments with more speed and accuracy We’re pleased to announce the rollout of CUPED for all our customers. Statsig will now automatically use CUPED to reduce variance and bias on experiments’ key metrics. This gives you access to a powerful experiment...
You Can’t Invent Without Experimenting When Amazon launched Home Services, the team was convinced that most people want to schedule home installations in the mornings, evenings, or weekends. This naturally constrained the number of available time slots, and...
Training your team to make independent decisions Image Courtesy: The New Yorker “It was like the debate of a group of savages as to how to extract a screw from a piece of wood. Accustomed only to nails, they had made one effort to pull out the screw by main...
Photo by Joshua Hoehne on Unsplash By now, most people realize that when they open Facebook or Instagram on their phone, their experience is very different than the person next to them. It goes deeper than just the content that you see, and the ranking...