Regression I: Linear Models
Build and diagnose linear regression models, understand assumptions, and evaluate predictive performance.
Content
Assumptions and Diagnostics
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Regression I: Linear Models — Assumptions and Diagnostics
"You can fit a model and get an R-squared that looks like it went to charm school — but if assumptions are broken, you're throwing a very fancy paper airplane into a hurricane."
You're already familiar with the geometry of simple linear regression and the algebraic formulation of multiple linear regression. You also know how to evaluate models using train/validation/test splits and cross-validation to avoid leakage and overfitting. Great. Now we get to the slightly less glamorous — but infinitely more responsible — part of regression: checking assumptions and diagnosing problems so your conclusions don't evaporate under scrutiny.
Opening: Why assumptions matter (and why reviewers will ask about them)
A linear regression model is not just a curve that hugs your points; it's a package of assumptions that let you interpret coefficients, make predictions, and compute standard errors. If assumptions fail, your point estimates might still be OK, but your confidence intervals, p-values, and causal claims can become junk. Think of assumptions as the scaffolding of a house: the paint may look fine, but without good beams, it collapses.
Quick reminder tying to previous material:
- From the geometry of simple linear regression, you know the fitted line minimizes squared errors (projection in Euclidean space). That geometric intuition helps you read residual plots.
- From multiple linear regression formulation, you remember beta hat = (X'X)^{-1} X'y. Problems in X (multicollinearity) or in the residuals propagate directly through that inverse.
- From train/validation/test and CV, you learned to evaluate predictive performance. Diagnostics add statistical credibility: good predictive performance doesn't exempt you from checking assumptions if you want interpretable coefficients or valid inference.
Main content: The assumptions, how to detect violations, and what to do about them
We'll go assumption by assumption like a kindly but brutally honest TA.
1) Linearity: the model is linear in parameters
- What it means: The expected value of y given X is a linear combination of the predictors: E[y|X] = Xβ.
- Symptoms of violation: Systematic patterns in residuals vs fitted values (e.g., curves, U-shape).
- Diagnostics: Residuals vs fitted plot; component-plus-residual (partial residual) plots; generalized additive model (GAM) check.
- Fixes: Add polynomial terms, interaction terms, basis expansions (splines), or switch to a nonlinear model.
Useful mental image: if residuals vs fitted looks like a smiley face, your linear model is trying to wear clown shoes.
2) Independence of errors (no autocorrelation)
- What it means: Residuals are uncorrelated across observations. Critical for time series or clustered data.
- Symptoms: Residuals show runs, trends, or correlation structure (especially in time series).
- Diagnostics: Durbin-Watson test; plot residuals against time or group index; autocorrelation function (ACF) plot.
- Fixes: Use lagged predictors, generalized least squares (GLS), mixed-effects models, or cluster-robust standard errors.
3) Homoscedasticity (constant variance)
- What it means: Var(epsilon | X) = sigma^2 — the spread of residuals is constant across levels of fitted values.
- Symptoms: Funnel-shaped residuals (variance increases with fitted); non-constant spread.
- Diagnostics: Residuals vs fitted plot; Breusch-Pagan test; White test.
- Fixes: Transform the outcome (log, Box-Cox), use weighted least squares (WLS), or robust (heteroskedasticity-consistent) standard errors.
4) Normality of errors (for inference)
- What it means: Errors are normally distributed. Note: normality is not required for unbiased betas, but matters for t-tests/CIs in small samples.
- Symptoms: Heavy tails, skewed residuals, outliers.
- Diagnostics: Q-Q plot of residuals; Shapiro-Wilk test (caution: sensitive to large sample sizes).
- Fixes: Transform y (log, Box-Cox), use bootstrap CIs, or use robust inference methods.
5) No perfect multicollinearity
- What it means: Predictors are not exact linear combinations of each other.
- Symptoms: Large standard errors, wild coefficient swings when adding/removing predictors.
- Diagnostics: Variance inflation factor (VIF); condition number of X; near-singular X'X warnings.
- Fixes: Remove or combine collinear variables, use PCA/regression on principal components, or regularized methods (ridge, lasso).
6) Correct model specification
- What it means: Important predictors aren't left out; no irrelevant functional form.
- Symptoms: Biased estimates, strange residual patterns, improved performance when adding omitted variable.
- Diagnostics: Ramsey RESET test; subject-matter checks; residual plots; added-variable plots.
- Fixes: Include missing confounders if available, add nonlinear terms, re-think causal model.
Diagnostics cheat-sheet (quick table)
| Assumption | Diagnostic plot/test | Typical remedy |
|---|---|---|
| Linearity | Residuals vs fitted, partial-residual plots | Add polynomials/splines |
| Independence | Durbin-Watson, ACF | GLS, mixed models, robust SEs |
| Homoscedasticity | Residuals vs fitted, BP test | Transform y, WLS, robust SEs |
| Normality | Q-Q plot, Shapiro-Wilk | Transform y, bootstrap |
| Multicollinearity | VIF, condition number | Drop/combine features, regularize |
| Specification | Ramsey RESET, residual patterns | Add terms, re-specify model |
Influence, leverage, and outliers — who's pulling the cart?
- Leverage measures how far an observation's X is from the mean X. High leverage = potential to influence the fit.
- Residual shows how much the model misses y for that point. A big residual + high leverage = dangerous.
- Cook's distance combines leverage and residual to quantify influence. Points with large Cook's D deserve scrutiny.
Practical rule-of-thumb: investigate points with Cook's D > 4/n, or large standardized residuals (>3), or leverage > 2p/n.
Hands-on mini recipe (pseudo-Python)
# assume statsmodels, sklearn, and numpy imported
# fit OLS with statsmodels for rich diagnostics
model = sm.OLS(y, X).fit()
print(model.summary()) # includes coefficients, SEs, R2
# diagnostic plots
resid = model.resid
fitted = model.fittedvalues
qqplot(resid) # visually inspect normality
plot(fitted, resid, '.') # residuals vs fitted
# heteroskedasticity
breusch_pagan_test = sms.het_breuschpagan(resid, model.model.exog)
# multicollinearity
vifs = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
# influence
influence = model.get_influence()
(c, p) = influence.cooks_distance
# if problems: try transformations / robust SEs / regularization
model_robust = model.get_robustcov_results(cov_type='HC3')
(If you prefer tidy sklearn pipelines, use statsmodels for inference diagnostics and sklearn for cross-validated predictive tuning.)
Closing: Checklist and next steps
- Always plot residuals vs fitted and a Q-Q plot. If you don't do anything else, do those two.
- Compute VIFs to check for multicollinearity.
- Use tests (Breusch-Pagan, Durbin-Watson) to confirm visual impressions, not replace them.
- If you find issues, ask: am I trying to make better predictions or make robust inference? Your remedy differs.
Key takeaways:
- Regression assumptions are not optional decorations — they are the rules that give your estimates meaning.
- Diagnostics are mostly visual but backed by tests; both matter.
- Remedies include transformations, weighting, robust SEs, model re-specification, or switching model families.
Next logical move (spoiler from the course roadmap): after you can diagnose and fix linear-model issues, we'll explore robust regression techniques and generalized linear models — tools for when assumptions are politely, persistently violated.
Final TA-level parting shot: "A model that fits the data but fails diagnostics is like a student who memorized the homework answers — maybe they can pass the test, but they didn't actually learn anything."
Suggested exercises: pick a real dataset (or your project data) and run the full diagnostics above. Document each violation you find, how you tested it, and the steps you took to fix or account for it. That write-up will be the difference between a cute chart and a credible analysis.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!