Platform

Resources

Docs Blog Pricing

Platform

Resources

Platform

Resources

How to think about the relationship between correlation and causation

Thu Feb 27 2025

Most of us have heard the phrase “correlation isn’t causation.”

You've probably run into the famous graph about survivorship bias and believe we understand this principle. Yet, people still confuse correlation with causation all the time.

You've probably heard this story before...

For example, this upsell that claims “4x” profile views as promised by LinkedIn Premium is definitely more correlation than causation. i.e., people who have higher profile views are more likely to pay for premium.

While the LinkedIn upsell may be an intentionally misleading advertisement, the unintended confusion is more dangerous. The confusion shows up in countless “aha moment” claims that don’t hold much water.

Why does this confusion happen? Because people crave causal analysis. Formally or informally, it’s the analysis that truly matters. As Tom Cunningham points out in his article Experiment Interpretation and Extrapolation:

People already have substantial knowledge about the effects of their decisions without either randomized experiments or natural experiments (IV, RDD, etc.). We built cathedrals, aeroplanes, welfare states, we doubled human life-expectancy, & WhatsApp grew to 1B users, all without randomized experiments or instrumental variables estimates.
These achievements were all based on causal inference but informal causal inference, i.e. using our instinctive knowledge of how to process information without writing down or calculating the assumptions and distributions.

Erick Gregory makes a related point in his LinkedIn post, arguing that all useful analysis implies something causal.

In this guide, we will use a simple formula to explain 1) the relationship between correlation and causation, 2) the main factor that causes confusion, and 3) how to get causation from correlation, whether quantitatively with causal inference models, or qualitatively with mental models.

A simple formula: Correlation ≈ causation + selection bias

There are many causal inference approaches—difference-in-difference, instrumental variables, propensity score matching, synthetic control, regression discontinuity design, uplift modeling, double ML, and so forth. Plenty of tools and packages make these techniques simpler to use. But under the hood, they all share one basic idea:

\[ \textit{Correlation} = \textit{Causation} + \textit{SelectionBias} + \textit{Minutiae} \]

For simplicity, let’s leave out the smaller details and say:

\[ \textit{Correlation} \approx \textit{Causation} + \textit{SelectionBias} \]

By keeping this formula in mind, you can:

Spot most cases of confusion between correlation and causation and form a clear idea of where the errors might come from.
Grasp the essence of causal inference models based on observed data. You’ll see exactly when their assumptions hold and when they don’t.

The first point alone offers a huge advantage because mistakes here are both common and costly. Let’s look at a simple thought experiment.

Can you spot the misleading evidence?

Imagine a pill that promises to help children grow taller. It’s given once a week for one year, targeting 15-year-olds. The people promoting the pill show you two data points:

Fifteen-year-old children who took the pill grew an average of 3 inches in one year.
In the same schools, fifteen-year-old children who did not take the pill grew an average of 2 inches in one year.

a simple illustration of a medical bottle that says

They have a large sample, and the difference is statistically significant. Is that enough proof the pill really works?

Using our formula:

\[ \textit{Correlation} \approx \textit{Causation} + \textit{SelectionBias} \]

Though the pill is linked to more growth, it could still come from causation, selection bias, or both. Examples of selection bias might be:

Families with more money can afford the pill and give their kids better nutrition.
Families who choose the pill care more about healthy growth and use other measures.
Families who opt for the pill have shorter kids to begin with, so they show more “catch-up” growth.

In each case, the pill itself might do nothing, yet you’d still see a statistically significant difference in growth. These different names for selection bias (confounding factors, latent variables, reverse causality, reversion to the mean, etc.) all boil down to the same idea: the groups differ in ways that affect the outcome, and it’s not purely the treatment causing the effect.

Why many “aha moments” are misleading

“Aha moments” are the special points in a user’s journey when they grasp a product’s value. That’s fine in theory. The trouble starts when people try to lock down a specific metric or target, like the famous claim that adding more than 10 friends in 7 days is the key to Facebook’s engagement.

Again, the same formula applies:

\[ \textit{Correlation} \approx \textit{Causation} + \textit{SelectionBias} \]

What’s the main problem here?

They’re driven by user choice: Aha moments reflect user-driven behavior, which depends on countless hidden factors. A big one is the user’s basic willingness to be active. Naturally, high-intent users will hit these moments more than less motivated ones.
High correlation with desired outcomes: These moments usually go hand in hand with higher product usage or activity, but the link can be skewed. Because users opt in, we get an endogeneity issue that muddies any clear causal claim.

People often assume these aha moments spark a surge in user engagement. But in many cases, you’re pouring resources into boosting a metric for users who would have been active anyway.

The better way to look at aha moments is:

Treat them as proxies: Aha moments are early predictors of user intent. Think of them as signals that someone is more likely to stay active or spend money. They’re still helpful, for example, when training machine learning models.
Use them to rally your team around a simple goal: The numbers—say, “10 friends in 7 days”—are not magic. But if they reflect something fundamental about how your product works, they can serve as a handy benchmark for the entire team.

Once you see these patterns, you can spot most instances where correlation is mistaken for causation and frame a strong hypothesis about possible selection bias. Simply focus on why certain people chose the “treatment,” and you’ll likely find the culprit.

All causal inference models aim to remove selection bias

Given our simplified formula:

\[ \textit{Correlation} \approx \textit{Causation} + \textit{SelectionBias} \]

If we remove selection bias, we can draw a causal conclusion from our observed correlation. The main challenge is that selection biases are often hidden and tough to measure directly.

Different causal inference models help us tackle this problem. Techniques like randomized controlled trials (RCTs), instrumental variables (IV), difference-in-differences (DID), and many others all try to cut out selection bias in different ways. In my recent LinkedIn survey, many people agreed that RCT, IV, and DID are big categories of causal methods, with DID covering a lot of ground.

Let me know if you are interested in a survey of popular causal inference models and how they can be categorized into three. We can publish a more detailed post in the future.

Closing thoughts

In a world flooded with data, it’s tempting to seize on correlations that look convincing—but correlation rarely stands on its own as proof of anything.

Always ask if selection bias might be at play. If you can rule it out, great—you’re closer to establishing genuine causation. If not, you may be chasing the wrong lead and pouring resources into something that won’t yield real results.

At the end of the day, our hunger for causal insight is what drives meaningful decisions. Armed with the simple formula and a clear sense of how selection bias creeps in, you’re in good shape to tell the difference between a real cause-and-effect story and a red herring dressed up as insight.

Request a demo

Statsig's experts are on standby to answer any questions about experimentation at your organization.

Grab a Demo

Permalink: https://www.statsig.com/blog/correlation-vs-causation-guide

Platform

Resources

Platform

Resources

Docs

Blog

Pricing

Back to Blog home

Yuzheng Sun, PhD

How to think about the relationship between correlation and causation

Most of us have heard the phrase “correlation isn’t causation.”

A simple formula: Correlation ≈ causation + selection bias

Can you spot the misleading evidence?

Why many “aha moments” are misleading

All causal inference models aim to remove selection bias

Closing thoughts

Request a demo

Recent Posts

Randomization: The ABC’s of A/B Testing

Allon Korem, Oryah Lancry-Dayan

You can have it all: Parallel testing with A/B tests

Allon Korem, Oryah Lancry-Dayan

Speeding up A/B tests with discipline

Yuzheng Sun, PhD

Move forward: The A/B testing mindset guide

Israel Ben Baruch

Experimentation and AI: 4 trends we’re seeing

Skye Scofield, Sid Kumar

From SEVs to self-serve: How we GitOps’d our infra with Pulumi & Argo CD

Tyrone Wong, Karan Luthra