5. Statistical Thinking and Regression to the Mean
Teach essential statistical intuitions—regression, base rates, sample size—and how neglecting them creates persistent mistakes.
Content
Interpreting Correlations and Causation
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Interpreting Correlations and Causation — A Practical Guide
"This is the moment where the concept finally clicks." — yes, right here.
You're coming in hot from sections on sample size, the law of large numbers, and that deliciously dangerous duo: the illusion of validity and overfitting. Good — because all three are the noise that makes correlations look like magic. Now we ask the harder, sexier question: when does a correlation actually tell us about cause?
Why this matters
- Policy-makers, doctors, and product managers routinely act as if correlation implies causation. Sometimes it works. Often it doesn't — and the mistakes can cost lives, money, or your credibility.
- In Thinking, Fast and Slow terms: System 1 loves pattern and story; System 2 must throttle it. Correlations excite your intuitive storyteller. Causation demands the skeptical scientist.
A quick reminder: what correlation is (and isn’t)
Correlation measures covariation — whether two variables move together. The Pearson r quantifies linear correlation; r² tells you the proportion of variance in Y explained linearly by X.
- High r means variables move together. Low r means they don't — or maybe they relate nonlinearly.
- Correlation is not direction. Two variables can be correlated because A causes B, B causes A, or a third variable C causes both.
Micro explanation: r vs r²
- r = 0.7 → positive association.
- r² = 0.49 → only 49% of variance in Y is linearly explained by X. Often people misread r as more explanatory than it is.
Two classic traps: Regression to the Mean and Spurious Correlations
If you learned about regression to the mean earlier (and you did), you know extreme outcomes tend to be followed by more average ones. That looks like causation if you don't control for it.
Example: A teacher gives extra help to students after a terrible test; the next test scores rise. Conclusion: the help worked. Alternative explanation: scores regressed toward the mean — the worst performers were unusually unlucky the first time.
Spurious correlations are everywhere: ice cream sales and drownings correlate (both rise in summer) but neither causes the other — a confounder (season/temperature) causes both.
The causal checklist: How to go from "They’re correlated" to "A causes B"
Use this like a pre-flight checklist before you decide to change a policy.
- Temporal precedence: Does A happen before B? If not, A can't cause B.
- Covariation: Is there a reliable statistical association? (You've got correlation.)
- Rule out alternatives: Are there plausible confounders (C) causing both A and B?
- Mechanism: Is there a plausible causal pathway? Stories feel good; a plausible mechanism makes them believable.
- Replication and robustness: Does the relationship hold across samples, times, and model specifications?
- Prefer randomized evidence: Randomized controlled trials (RCTs) are the gold standard, because they randomize away confounders.
If you can’t randomize, stronger quasi-experimental tools
- Natural experiments (policy changes, sudden shocks)
- Instrumental variables (IV): find a variable that affects A but only affects B through A
- Difference-in-differences (DiD): compare changes over time between treated and control groups
- Regression discontinuity: exploit arbitrary cutoffs that assign treatment
These are the clever ways economists and epidemiologists mimic randomization when real RCTs aren’t possible.
Confounding, Selection Bias, and Simpson’s Paradox (the drama queen of stats)
- Confounding: A confounder C influences both A and B. Example: Education (A) correlates with income (B), but innate ability and family background (C) affect both.
- Selection bias: Your sample is not representative. If you only look at successful startups, founders' traits that correlate with success may be misleading.
- Simpson’s paradox: Aggregated data shows one trend; sliced data reverses it. Famous example: a treatment seems effective overall but harmful within every subgroup — because groups differed in baseline risks.
Always ask: how was the sample chosen? What groups are being lumped together?
Cognitive biases that make us infer causation wrongly
Let's tie this back to Prospect Theory and what we learned about value and probability weighting.
- System 1 loves narratives and is loss-averse: it will latch onto correlations that support a compelling loss/gain story (prospect theory).
- Probability weighting means we overweight rare but vivid events — a single dramatic correlation gets more cognitive weight than dozens of null findings.
- Illusion of validity & overfitting: with enough variables and small samples, you’ll find patterns that look causal but are just noise.
Translation: your brain will overfit a causal story to a small dataset and call it Truth. Slow down.
Practical heuristics: A quick decision rule when you see a correlation
- Ask for timing: which came first?
- Hunt for third variables: what else could explain both?
- Check sample size and variability (remember law of large numbers).
- Look for replication in other contexts.
- Prefer experiments; if not available, seek credible quasi-experiments.
Short mental script: "Could this be regression to the mean, confounding, selection bias, or reverse causality?" If any answer is yes, be cautious.
A tiny worked example (no math terror)
Scenario: A city introduces a new policing policy in January. Crime falls 20% by June. Mayor celebrates.
Questions to ask:
- Was there seasonal crime decline anyway? (confounder: season)
- Did crime fall everywhere—or only in neighborhoods where resources were already changing? (selection)
- Did reporting practices change? (measurement)
- Were similar declines observed in comparable cities without the policy? (replication/natural experiment)
If you can't rule these out, claiming causation is premature.
Key takeaways (the ones you’ll actually remember)
- Correlation ≠ causation. It’s not a motto; it’s a survival skill.
- Always consider temporal order, confounders, and mechanism.
- Use experiments or quasi-experiments when possible; otherwise be skeptical and look for replication.
- Your mind (System 1) will love a neat causal story. Use System 2 to check the checklist.
Memorable insight: A strong causal claim requires both a strong association and a strong reason not to be fooled.
Go forth and interrogate correlations like a polite but relentless detective.
Further reading and quick next steps
- Revisit the sections on Sample Size and Illusion of Validity — small samples + belief in patterns = spurious causation.
- If you liked the detective work, next dive into causal diagrams (DAGs) and simple instrumental variables — they’re the magnifying glasses of causal inference.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!