You know what it means for a result to be “Statistically Significant”, you know how to use and interpret P values, and you can explain the concept of an “A/B test” in your sleep.
But there’s a whole range of advanced experimentation terms and statistical treatments that are a bit harder to understand.
To help, we’ve created a short guide on these terms & how to interpret them, with straightforward definitions, a guide on when to use them, and more.
Hopefully you’ll find these helpful! If you’re interested in exploring the ways that we’ve implemented these concepts, please check out our docs.
Happy experimenting!
Definition: The Bonferroni Correction is a statistical adjustment used to counteract the problem of multiple comparisons. When conducting multiple statistical tests, the likelihood of a false positive (incorrectly finding a significant result) increases. The Bonferroni Correction compensates for this by making the criteria for statistical significance more stringent.
Effect of the treatment: The effect of applying the Bonferroni Correction is a reduction in the probability of incorrectly declaring a result as significant due to random chance. It increases the threshold needed for statistical significance, thereby decreasing the likelihood of type I errors (false positives).
When to use it: The Bonferroni Correction should be used when conducting multiple statistical tests simultaneously, especially if there's a high risk of encountering false positives. It's commonly applied in fields like biomedical research, psychology, and any area where multiple hypotheses are being tested at once.
How it's applied: To apply the Bonferroni Correction, divide the standard threshold for significance (usually p = 0.05) by the number of tests being conducted. For example, if you are conducting 10 tests, the new threshold for each test to be considered statistically significant would be 0.05/10, or 0.005. Only results that meet this stricter criterion are deemed significant.
📖 Related reading: The Statsig Bonferroni Correction documentation.
Definition: Winsorization is a statistical technique used to limit extreme values in a data set to reduce the effect of potentially spurious outliers. It involves replacing specified percentiles of a data distribution at both ends with values closer to the median or mean. This method is used to handle outliers without completely discarding them, as in trimming.
Effect of the treatment: The primary effect of Winsorization is the reduction of the impact of outliers on statistical analyses. By limiting extreme values, this method makes the data set more robust against anomalies, potentially leading to more reliable statistical inferences, especially in the presence of skewed data.
When to use it: Winsorization is particularly useful when you want to lessen the influence of outliers but retain all data points in your analysis. It's often employed in scenarios where extreme values are believed to be genuine but disproportionately influential, such as in economic data, environmental measurements, or test scores.
How it's applied: To apply Winsorization, first decide on the percentile at which you want to cap the data (common choices are the 5th and 95th percentiles). Then, replace all data points below the lower percentile with the value at that percentile and all data points above the upper percentile with the value at that percentile. For example, if Windsorizing at the 5th and 95th percentiles, all values below the 5th percentile are replaced with the 5th percentile value, and all values above the 95th percentile are replaced with the 95th percentile value.
📖 Related reading: The Statsig Winsorization documentation.
Definition: CUPED, or Controlled-experiment Using Pre-Experiment Data, is a statistical technique used to improve the efficiency of experimental results by reducing variance. It involves using data collected before the experiment (pre-experiment data) to adjust the post-experiment outcomes. This method is particularly effective in controlled experiments where baseline variability can obscure the true effect of the treatment.
Effect of the treatment: The effect of using CUPED is a significant reduction in the variance of the experiment's outcome measures. This leads to more precise estimates of the treatment effect and increases the power of the experiment, making it easier to detect a true effect if one exists.
When to use it: CUPED is particularly useful in experiments where there is high variability in baseline measurements that can mask the effects of the treatment. It's widely used in online A/B testing, clinical trials, and any scenario where pre-experiment data is available and relevant to the outcome of interest.
How it's applied: To apply CUPED, you first identify a covariate from the pre-experiment data that is correlated with the outcome of interest. Then, you calculate the variance of the outcome that is attributable to this covariate and adjust the post-experiment data accordingly. Essentially, you're removing the part of the variance that can be predicted from the pre-experiment data, leading to a more accurate estimation of the treatment effect.
📖 Related reading:
Definition: The Multi-Armed Bandit (MAB) problem is a scenario in decision theory and reinforcement learning where a fixed limited set of resources must be allocated between competing (multiple) choices in a way that maximizes their expected gain, under uncertainty. Each choice's properties are only partially known at the time of allocation and may become better understood as time passes.
Effect of the treatment: In the context of experimentation, the MAB approach leads to more efficient resource allocation, as it dynamically adjusts the allocation based on the performance of each option. It's a balance between exploration (trying out each option to gather more information) and exploitation (using the best-performing option to maximize reward).
When to use it: MAB is useful in situations where you have to make sequential decisions with incomplete information, and where the cost of learning about the environment is high. Common applications include online recommendation systems, clinical trials for finding effective treatments, and optimizing strategies in online advertising.
How it's applied: In a Multi-Armed Bandit scenario, you start by allocating resources evenly among all choices or based on initial estimates of their effectiveness. As you gather data on their performance, you adjust the allocation to favor more promising options. Several strategies, like epsilon-greedy, Upper Confidence Bound (UCB), or Thompson Sampling, can be used to balance the exploration-exploitation tradeoff. These methods help decide which 'arm' to pull (choice to make) in each round, based on past performance and the potential for discovering even better options.
📖 Related reading: Statsig's Autotune feature.
Definition: Stratified sampling is a method of sampling from a population. In this technique, the population is divided into subgroups, or strata, that share similar characteristics. Samples are then drawn from each stratum, often in proportion to the size of the strata in the overall population. This approach aims to capture key population characteristics in the sample, reducing sampling bias and improving representativeness.
Effect of the treatment: Stratified sampling leads to more representative samples, which are crucial for obtaining accurate and generalizable results. By ensuring each subgroup is adequately represented, this method minimizes sampling error and provides a clearer, more accurate picture of the entire population.
When to use it: This method is particularly useful when the population has diverse subgroups, and the characteristics of these subgroups are important for the research questions at hand. Stratified sampling is commonly used in surveys and studies where demographic representativeness is crucial, such as in market research, public opinion polling, and epidemiological studies.
How it's applied: To apply stratified sampling, first identify the key subgroups (strata) in the population. These could be based on demographics like age, gender, income level, etc. Then, determine the proportion of each stratum in the overall population. Finally, sample from each stratum separately, typically in proportion to their size in the population. The samples from all strata are then combined to form a complete sample representative of the entire population.
📖 Related reading: Statsig's stratified sampling documentation.
Understand the difference between one-tailed and two-tailed tests. This guide will help you choose between using a one-tailed or two-tailed hypothesis! Read More ⇾
This guide explains why the allocation point may differ from the exposure point, how it happens, and what you to do about it. Read More ⇾
From continuous integration and deployment to a scrappy, results-driven mindset, learn how we prioritize speed and precision to deliver results quickly and safely Read More ⇾
The Statsig <> Azure AI Integration is a powerful solution for configuring, measuring, and optimizing AI applications. Read More ⇾
Take an inside look at how we built Statsig, and why we handle assignment the way we do. Read More ⇾
Learn the takeaways from Ron Kohavi's presentation at Significance Summit wherein he discussed the challenges of experimentation and how to overcome them. Read More ⇾