Courses/Thinking Fast and Slow/5. Statistical Thinking and Regression to the Mean

5. Statistical Thinking and Regression to the Mean

13255 views

Teach essential statistical intuitions—regression, base rates, sample size—and how neglecting them creates persistent mistakes.

Content

5 of 10

Interpreting Correlations and Causation

Interpreting Correlation vs Causation: Clear Statistical Guide

3819 views

beginner

statistical-thinking

causal-inference

psychology

humorous

gpt-5-mini

3819 views

Versions:

Interpreting Correlation vs Causation: Clear Statistical Guide

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Interpreting Correlations and Causation — A Practical Guide

"This is the moment where the concept finally clicks." — yes, right here.

You're coming in hot from sections on sample size, the law of large numbers, and that deliciously dangerous duo: the illusion of validity and overfitting. Good — because all three are the noise that makes correlations look like magic. Now we ask the harder, sexier question: when does a correlation actually tell us about cause?

Why this matters

Policy-makers, doctors, and product managers routinely act as if correlation implies causation. Sometimes it works. Often it doesn't — and the mistakes can cost lives, money, or your credibility.
In Thinking, Fast and Slow terms: System 1 loves pattern and story; System 2 must throttle it. Correlations excite your intuitive storyteller. Causation demands the skeptical scientist.

A quick reminder: what correlation is (and isn’t)

Correlation measures covariation — whether two variables move together. The Pearson r quantifies linear correlation; r² tells you the proportion of variance in Y explained linearly by X.

High r means variables move together. Low r means they don't — or maybe they relate nonlinearly.
Correlation is not direction. Two variables can be correlated because A causes B, B causes A, or a third variable C causes both.

Micro explanation: r vs r²

r = 0.7 → positive association.
r² = 0.49 → only 49% of variance in Y is linearly explained by X. Often people misread r as more explanatory than it is.

Two classic traps: Regression to the Mean and Spurious Correlations

If you learned about regression to the mean earlier (and you did), you know extreme outcomes tend to be followed by more average ones. That looks like causation if you don't control for it.

Example: A teacher gives extra help to students after a terrible test; the next test scores rise. Conclusion: the help worked. Alternative explanation: scores regressed toward the mean — the worst performers were unusually unlucky the first time.

Spurious correlations are everywhere: ice cream sales and drownings correlate (both rise in summer) but neither causes the other — a confounder (season/temperature) causes both.

The causal checklist: How to go from "They’re correlated" to "A causes B"

Use this like a pre-flight checklist before you decide to change a policy.

Temporal precedence: Does A happen before B? If not, A can't cause B.
Covariation: Is there a reliable statistical association? (You've got correlation.)
Rule out alternatives: Are there plausible confounders (C) causing both A and B?
Mechanism: Is there a plausible causal pathway? Stories feel good; a plausible mechanism makes them believable.
Replication and robustness: Does the relationship hold across samples, times, and model specifications?
Prefer randomized evidence: Randomized controlled trials (RCTs) are the gold standard, because they randomize away confounders.

If you can’t randomize, stronger quasi-experimental tools

Natural experiments (policy changes, sudden shocks)
Instrumental variables (IV): find a variable that affects A but only affects B through A
Difference-in-differences (DiD): compare changes over time between treated and control groups
Regression discontinuity: exploit arbitrary cutoffs that assign treatment

These are the clever ways economists and epidemiologists mimic randomization when real RCTs aren’t possible.

Confounding, Selection Bias, and Simpson’s Paradox (the drama queen of stats)

Confounding: A confounder C influences both A and B. Example: Education (A) correlates with income (B), but innate ability and family background (C) affect both.
Selection bias: Your sample is not representative. If you only look at successful startups, founders' traits that correlate with success may be misleading.
Simpson’s paradox: Aggregated data shows one trend; sliced data reverses it. Famous example: a treatment seems effective overall but harmful within every subgroup — because groups differed in baseline risks.

Always ask: how was the sample chosen? What groups are being lumped together?

Cognitive biases that make us infer causation wrongly

Let's tie this back to Prospect Theory and what we learned about value and probability weighting.

System 1 loves narratives and is loss-averse: it will latch onto correlations that support a compelling loss/gain story (prospect theory).
Probability weighting means we overweight rare but vivid events — a single dramatic correlation gets more cognitive weight than dozens of null findings.
Illusion of validity & overfitting: with enough variables and small samples, you’ll find patterns that look causal but are just noise.

Translation: your brain will overfit a causal story to a small dataset and call it Truth. Slow down.

Practical heuristics: A quick decision rule when you see a correlation

Ask for timing: which came first?
Hunt for third variables: what else could explain both?
Check sample size and variability (remember law of large numbers).
Look for replication in other contexts.
Prefer experiments; if not available, seek credible quasi-experiments.

Short mental script: "Could this be regression to the mean, confounding, selection bias, or reverse causality?" If any answer is yes, be cautious.

A tiny worked example (no math terror)

Scenario: A city introduces a new policing policy in January. Crime falls 20% by June. Mayor celebrates.

Questions to ask:

Was there seasonal crime decline anyway? (confounder: season)
Did crime fall everywhere—or only in neighborhoods where resources were already changing? (selection)
Did reporting practices change? (measurement)
Were similar declines observed in comparable cities without the policy? (replication/natural experiment)

If you can't rule these out, claiming causation is premature.

Key takeaways (the ones you’ll actually remember)

Correlation ≠ causation. It’s not a motto; it’s a survival skill.
Always consider temporal order, confounders, and mechanism.
Use experiments or quasi-experiments when possible; otherwise be skeptical and look for replication.
Your mind (System 1) will love a neat causal story. Use System 2 to check the checklist.

Memorable insight: A strong causal claim requires both a strong association and a strong reason not to be fooled.