Quantitative Methods
Fundamentals of quantitative analysis used in finance.
Content
Correlation and Regression
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Correlation and Regression — The Sexy Side of Numbers (Actually Useful for CFA L1)
"Correlation is the gossip of data; regression is the confession booth." — Probably me, right now.
You just finished Probability Concepts and Statistical Inference — so you know about distributions, sampling variability, and hypothesis testing. Now we move to the duo that lets you quantify relationships between variables: correlation (how tight the gossip circle is) and regression (who influences whom — or at least who looks like they do). This is crucial for finance: think factor models, forecasting returns, or just making your Excel look impressively academic. But remember Ethics 101: correlation ≠ causation — misuse here is a fast route to misleading clients (and failing ethics questions).
1) Correlation: The Short Summary
What it is: A standardized measure of linear association between two variables. The most common: Pearson correlation coefficient (r).
- Range: -1 to +1.
- r = +1 perfect positive linear relationship
- r = -1 perfect negative linear relationship
- r ≈ 0 little-to-no linear relationship
- Formula (conceptual):
r = cov(X, Y) / (σ_X * σ_Y)
- Interpretation: If r = 0.8, X and Y move together strongly in a linear sense. If r = 0.2, weak linear association — but there might still be a non-linear relationship.
Quick heuristics (context matters!):
- |r| < 0.3 — weak
- 0.3 ≤ |r| < 0.6 — moderate
- |r| ≥ 0.6 — strong
Ask yourself: Is the correlation economically meaningful, or just statistically significant because my sample is huge? Large N can make tiny r significant. That's where your Statistical Inference lessons kick in.
Nonparametric alternative
- Spearman rank correlation: measures monotonic relationships (good when data aren't linear or are ordinal).
2) Simple Linear Regression: The Basics
Model:
Y = β0 + β1 X + ε
- β1 (slope): expected change in Y for a one-unit change in X (ceteris paribus).
- β0 (intercept): predicted value of Y when X = 0 (may be meaningless if X = 0 is outside data range).
Estimation (OLS): choose β-hats to minimize sum of squared residuals.
Formulas (for simple regression):
β1_hat = Σ(x_i - x̄)(y_i - ȳ) / Σ(x_i - x̄)^2
β0_hat = ȳ - β1_hat * x̄
Good to know: OLS gives unbiased estimates under the Gauss–Markov assumptions (we’ll summarize these next).
Partitioning variance: SST = SSR + SSE
- SST (total) = Σ(y_i - ȳ)^2
- SSR (explained by model) = Σ(ŷ_i - ȳ)^2
- SSE (residual) = Σ(y_i - ŷ_i)^2
R-squared: SSR / SST — proportion of variance in Y explained by X.
Note: A high R² isn't an automatic green light. Check residuals, think economics/logic, and watch for overfitting.
3) Hypothesis testing in regression
Test slope = 0 (no linear relationship):
t = β1_hat / SE(β1_hat)
Compare t to t-critical or compute a p-value. This ties directly to your Statistical Inference knowledge: sampling distributions, t-statistics, and confidence intervals.
Confidence interval for β1: β1_hat ± t_(α/2, n-2) * SE(β1_hat).
Prediction vs. Estimation:
- Confidence interval: for the mean E[Y|X=x0]
- Prediction interval: for an individual Y at X = x0 (wider because includes residual variance)
4) Assumptions (the LINE checklist) and what breaks
- Linearity: relationship is linear in parameters
- Independence of errors: no autocorrelation
- Normality of errors (for small-sample inference)
- Equal variance (homoskedasticity)
If assumptions are violated: biased or inefficient estimates, wrong SEs, and misleading inference.
Common problems and quick remedies:
- Heteroskedasticity → use robust (White) standard errors
- Autocorrelation (time series) → use Durbin–Watson test; consider AR models or Newey–West SEs
- Multicollinearity (in multiple regression) → large SEs, unstable β-hats; check VIFs (>10 is suspicious)
- Omitted variable bias → estimate may be biased; think carefully about causal structure
Omitted variable bias formula (simple intuition):
Bias(β1_hat) = β2 * [Cov(X1, X2) / Var(X1)]
Meaning: if an omitted variable affects Y and correlates with X, your β1 is biased.
5) Practical finance example (mini)
Imagine regressing a stock's excess return (Y) on market excess return (X) — the CAPM spirit.
- β1_hat is the stock's beta (systematic risk).
- Test H0: β1 = 1 (is the stock as risky as market?) with t-test — this is a hypothesis test you've seen in Statistical Inference.
- Low R² doesn't mean beta useless — beta may still be a key parameter for risk.
Table (toy data):
| Month | Market Excess (%) | Stock Excess (%) |
|---|---|---|
| 1 | 2.0 | 3.0 |
| 2 | -1.0 | -1.5 |
| 3 | 1.5 | 1.0 |
| 4 | 0.0 | 0.2 |
(You'd compute β1_hat using the formulas above — practice this in Excel or your calculator.)
6) Ethics: Don’t be that analyst who lies with statistics
- Never imply causation from correlation without a defensible causal model.
- Don’t cherry-pick variables or time periods to produce a headline-grabbing R².
- Disclose model limitations: sample period, data snooping, and assumption checks.
If your regression magically predicts everything with R² = 0.99, either you’ve discovered a financial miracle or you accidentally leaked future information into your predictor. Probable guilty party: look-back bias or data leakage.
7) Quick checklist before you report regression results
- Plot data and residuals (visualize before you worship a number).
- Check linearity and influential points (Cook’s distance).
- Test for heteroskedasticity and autocorrelation if time series.
- Consider multicollinearity in multivariate models (VIFs).
- Report β-hats, SEs, t-stats, p-values, R² (and adj. R²), and prediction vs confidence intervals.
- Be upfront about potential omitted variables and causality limits.
Closing: TL;DR (with Flair)
- Correlation tells you about co-movement, not cause.
- Regression estimates marginal effects and lets you test hypotheses (bring your t-tests!).
- Assumptions matter — violate them and your inference is a house of cards.
- Ethics matters — statistical glamour without transparency = investor harm and exam failure.
Final pep talk: Run your regressions, but don’t worship coefficients. Combine math with economic sense, check assumptions, and always ask: Does this story make sense outside the sample? If not, don’t publish it; fix it.
Version note: This builds on the probability and inference foundations you’ve already learned — now you get to apply those tests to relationships between variables and ask the ethical questions that separate decent analysts from dinner-table anecdote-sellers.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!