Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

Simple Linear Regression Geometry Multiple Linear Regression Formulation Assumptions and Diagnostics Ordinary Least Squares Solution Gradient Descent for OLS Heteroscedasticity and Robust Losses Transformations of Targets and Features Categorical Variables in Regression Interaction Terms in Linear Models Multicollinearity and VIF Prediction Intervals vs Confidence Intervals Feature Scaling Effects in OLS Handling Outliers with Huber and Quantile Loss Model Interpretation with Coefficients Baseline and Dummy Regressors

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Regression I: Linear Models

Regression I: Linear Models

24992 views

Build and diagnose linear regression models, understand assumptions, and evaluate predictive performance.

Content

3 of 15

Assumptions and Diagnostics

Diagnostics with Sass and Rigor

4637 views

intermediate

humorous

visual

machine learning

gpt-5-mini

4637 views

Versions:

Diagnostics with Sass and Rigor

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Regression I: Linear Models — Assumptions and Diagnostics

"You can fit a model and get an R-squared that looks like it went to charm school — but if assumptions are broken, you're throwing a very fancy paper airplane into a hurricane."

You're already familiar with the geometry of simple linear regression and the algebraic formulation of multiple linear regression. You also know how to evaluate models using train/validation/test splits and cross-validation to avoid leakage and overfitting. Great. Now we get to the slightly less glamorous — but infinitely more responsible — part of regression: checking assumptions and diagnosing problems so your conclusions don't evaporate under scrutiny.

Opening: Why assumptions matter (and why reviewers will ask about them)

A linear regression model is not just a curve that hugs your points; it's a package of assumptions that let you interpret coefficients, make predictions, and compute standard errors. If assumptions fail, your point estimates might still be OK, but your confidence intervals, p-values, and causal claims can become junk. Think of assumptions as the scaffolding of a house: the paint may look fine, but without good beams, it collapses.

Quick reminder tying to previous material:

From the geometry of simple linear regression, you know the fitted line minimizes squared errors (projection in Euclidean space). That geometric intuition helps you read residual plots.
From multiple linear regression formulation, you remember beta hat = (X'X)^{-1} X'y. Problems in X (multicollinearity) or in the residuals propagate directly through that inverse.
From train/validation/test and CV, you learned to evaluate predictive performance. Diagnostics add statistical credibility: good predictive performance doesn't exempt you from checking assumptions if you want interpretable coefficients or valid inference.

Main content: The assumptions, how to detect violations, and what to do about them

We'll go assumption by assumption like a kindly but brutally honest TA.

1) Linearity: the model is linear in parameters

What it means: The expected value of y given X is a linear combination of the predictors: E[y|X] = Xβ.
Symptoms of violation: Systematic patterns in residuals vs fitted values (e.g., curves, U-shape).
Diagnostics: Residuals vs fitted plot; component-plus-residual (partial residual) plots; generalized additive model (GAM) check.
Fixes: Add polynomial terms, interaction terms, basis expansions (splines), or switch to a nonlinear model.

Useful mental image: if residuals vs fitted looks like a smiley face, your linear model is trying to wear clown shoes.

2) Independence of errors (no autocorrelation)

What it means: Residuals are uncorrelated across observations. Critical for time series or clustered data.
Symptoms: Residuals show runs, trends, or correlation structure (especially in time series).
Diagnostics: Durbin-Watson test; plot residuals against time or group index; autocorrelation function (ACF) plot.
Fixes: Use lagged predictors, generalized least squares (GLS), mixed-effects models, or cluster-robust standard errors.

3) Homoscedasticity (constant variance)

What it means: Var(epsilon | X) = sigma^2 — the spread of residuals is constant across levels of fitted values.
Symptoms: Funnel-shaped residuals (variance increases with fitted); non-constant spread.
Diagnostics: Residuals vs fitted plot; Breusch-Pagan test; White test.
Fixes: Transform the outcome (log, Box-Cox), use weighted least squares (WLS), or robust (heteroskedasticity-consistent) standard errors.

4) Normality of errors (for inference)

What it means: Errors are normally distributed. Note: normality is not required for unbiased betas, but matters for t-tests/CIs in small samples.
Symptoms: Heavy tails, skewed residuals, outliers.
Diagnostics: Q-Q plot of residuals; Shapiro-Wilk test (caution: sensitive to large sample sizes).
Fixes: Transform y (log, Box-Cox), use bootstrap CIs, or use robust inference methods.

5) No perfect multicollinearity

What it means: Predictors are not exact linear combinations of each other.
Symptoms: Large standard errors, wild coefficient swings when adding/removing predictors.
Diagnostics: Variance inflation factor (VIF); condition number of X; near-singular X'X warnings.
Fixes: Remove or combine collinear variables, use PCA/regression on principal components, or regularized methods (ridge, lasso).

6) Correct model specification

What it means: Important predictors aren't left out; no irrelevant functional form.
Symptoms: Biased estimates, strange residual patterns, improved performance when adding omitted variable.
Diagnostics: Ramsey RESET test; subject-matter checks; residual plots; added-variable plots.
Fixes: Include missing confounders if available, add nonlinear terms, re-think causal model.

Diagnostics cheat-sheet (quick table)

Assumption	Diagnostic plot/test	Typical remedy
Linearity	Residuals vs fitted, partial-residual plots	Add polynomials/splines
Independence	Durbin-Watson, ACF	GLS, mixed models, robust SEs
Homoscedasticity	Residuals vs fitted, BP test	Transform y, WLS, robust SEs
Normality	Q-Q plot, Shapiro-Wilk	Transform y, bootstrap
Multicollinearity	VIF, condition number	Drop/combine features, regularize
Specification	Ramsey RESET, residual patterns	Add terms, re-specify model

Influence, leverage, and outliers — who's pulling the cart?

Leverage measures how far an observation's X is from the mean X. High leverage = potential to influence the fit.
Residual shows how much the model misses y for that point. A big residual + high leverage = dangerous.
Cook's distance combines leverage and residual to quantify influence. Points with large Cook's D deserve scrutiny.

Practical rule-of-thumb: investigate points with Cook's D > 4/n, or large standardized residuals (>3), or leverage > 2p/n.

Hands-on mini recipe (pseudo-Python)

# assume statsmodels, sklearn, and numpy imported
# fit OLS with statsmodels for rich diagnostics
model = sm.OLS(y, X).fit()
print(model.summary())  # includes coefficients, SEs, R2

# diagnostic plots
resid = model.resid
fitted = model.fittedvalues
qqplot(resid)  # visually inspect normality
plot(fitted, resid, '.')  # residuals vs fitted

# heteroskedasticity
breusch_pagan_test = sms.het_breuschpagan(resid, model.model.exog)

# multicollinearity
vifs = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

# influence
influence = model.get_influence()
(c, p) = influence.cooks_distance

# if problems: try transformations / robust SEs / regularization
model_robust = model.get_robustcov_results(cov_type='HC3')

(If you prefer tidy sklearn pipelines, use statsmodels for inference diagnostics and sklearn for cross-validated predictive tuning.)

Closing: Checklist and next steps

Always plot residuals vs fitted and a Q-Q plot. If you don't do anything else, do those two.
Compute VIFs to check for multicollinearity.
Use tests (Breusch-Pagan, Durbin-Watson) to confirm visual impressions, not replace them.
If you find issues, ask: am I trying to make better predictions or make robust inference? Your remedy differs.

Key takeaways:

Regression assumptions are not optional decorations — they are the rules that give your estimates meaning.
Diagnostics are mostly visual but backed by tests; both matter.
Remedies include transformations, weighting, robust SEs, model re-specification, or switching model families.

Next logical move (spoiler from the course roadmap): after you can diagnose and fix linear-model issues, we'll explore robust regression techniques and generalized linear models — tools for when assumptions are politely, persistently violated.

Final TA-level parting shot: "A model that fits the data but fails diagnostics is like a student who memorized the homework answers — maybe they can pass the test, but they didn't actually learn anything."

Suggested exercises: pick a real dataset (or your project data) and run the full diagnostics above. Document each violation you find, how you tested it, and the steps you took to fix or account for it. That write-up will be the difference between a cute chart and a credible analysis.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics