jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Supervised Machine Learning: Regression and Classification
Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

Simple Linear Regression GeometryMultiple Linear Regression FormulationAssumptions and DiagnosticsOrdinary Least Squares SolutionGradient Descent for OLSHeteroscedasticity and Robust LossesTransformations of Targets and FeaturesCategorical Variables in RegressionInteraction Terms in Linear ModelsMulticollinearity and VIFPrediction Intervals vs Confidence IntervalsFeature Scaling Effects in OLSHandling Outliers with Huber and Quantile LossModel Interpretation with CoefficientsBaseline and Dummy Regressors

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Regression I: Linear Models

Regression I: Linear Models

24980 views

Build and diagnose linear regression models, understand assumptions, and evaluate predictive performance.

Content

3 of 15

Assumptions and Diagnostics

Diagnostics with Sass and Rigor
4635 views
intermediate
humorous
visual
machine learning
gpt-5-mini
4635 views

Versions:

Diagnostics with Sass and Rigor

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Regression I: Linear Models — Assumptions and Diagnostics

"You can fit a model and get an R-squared that looks like it went to charm school — but if assumptions are broken, you're throwing a very fancy paper airplane into a hurricane."

You're already familiar with the geometry of simple linear regression and the algebraic formulation of multiple linear regression. You also know how to evaluate models using train/validation/test splits and cross-validation to avoid leakage and overfitting. Great. Now we get to the slightly less glamorous — but infinitely more responsible — part of regression: checking assumptions and diagnosing problems so your conclusions don't evaporate under scrutiny.


Opening: Why assumptions matter (and why reviewers will ask about them)

A linear regression model is not just a curve that hugs your points; it's a package of assumptions that let you interpret coefficients, make predictions, and compute standard errors. If assumptions fail, your point estimates might still be OK, but your confidence intervals, p-values, and causal claims can become junk. Think of assumptions as the scaffolding of a house: the paint may look fine, but without good beams, it collapses.

Quick reminder tying to previous material:

  • From the geometry of simple linear regression, you know the fitted line minimizes squared errors (projection in Euclidean space). That geometric intuition helps you read residual plots.
  • From multiple linear regression formulation, you remember beta hat = (X'X)^{-1} X'y. Problems in X (multicollinearity) or in the residuals propagate directly through that inverse.
  • From train/validation/test and CV, you learned to evaluate predictive performance. Diagnostics add statistical credibility: good predictive performance doesn't exempt you from checking assumptions if you want interpretable coefficients or valid inference.

Main content: The assumptions, how to detect violations, and what to do about them

We'll go assumption by assumption like a kindly but brutally honest TA.

1) Linearity: the model is linear in parameters

  • What it means: The expected value of y given X is a linear combination of the predictors: E[y|X] = Xβ.
  • Symptoms of violation: Systematic patterns in residuals vs fitted values (e.g., curves, U-shape).
  • Diagnostics: Residuals vs fitted plot; component-plus-residual (partial residual) plots; generalized additive model (GAM) check.
  • Fixes: Add polynomial terms, interaction terms, basis expansions (splines), or switch to a nonlinear model.

Useful mental image: if residuals vs fitted looks like a smiley face, your linear model is trying to wear clown shoes.

2) Independence of errors (no autocorrelation)

  • What it means: Residuals are uncorrelated across observations. Critical for time series or clustered data.
  • Symptoms: Residuals show runs, trends, or correlation structure (especially in time series).
  • Diagnostics: Durbin-Watson test; plot residuals against time or group index; autocorrelation function (ACF) plot.
  • Fixes: Use lagged predictors, generalized least squares (GLS), mixed-effects models, or cluster-robust standard errors.

3) Homoscedasticity (constant variance)

  • What it means: Var(epsilon | X) = sigma^2 — the spread of residuals is constant across levels of fitted values.
  • Symptoms: Funnel-shaped residuals (variance increases with fitted); non-constant spread.
  • Diagnostics: Residuals vs fitted plot; Breusch-Pagan test; White test.
  • Fixes: Transform the outcome (log, Box-Cox), use weighted least squares (WLS), or robust (heteroskedasticity-consistent) standard errors.

4) Normality of errors (for inference)

  • What it means: Errors are normally distributed. Note: normality is not required for unbiased betas, but matters for t-tests/CIs in small samples.
  • Symptoms: Heavy tails, skewed residuals, outliers.
  • Diagnostics: Q-Q plot of residuals; Shapiro-Wilk test (caution: sensitive to large sample sizes).
  • Fixes: Transform y (log, Box-Cox), use bootstrap CIs, or use robust inference methods.

5) No perfect multicollinearity

  • What it means: Predictors are not exact linear combinations of each other.
  • Symptoms: Large standard errors, wild coefficient swings when adding/removing predictors.
  • Diagnostics: Variance inflation factor (VIF); condition number of X; near-singular X'X warnings.
  • Fixes: Remove or combine collinear variables, use PCA/regression on principal components, or regularized methods (ridge, lasso).

6) Correct model specification

  • What it means: Important predictors aren't left out; no irrelevant functional form.
  • Symptoms: Biased estimates, strange residual patterns, improved performance when adding omitted variable.
  • Diagnostics: Ramsey RESET test; subject-matter checks; residual plots; added-variable plots.
  • Fixes: Include missing confounders if available, add nonlinear terms, re-think causal model.

Diagnostics cheat-sheet (quick table)

Assumption Diagnostic plot/test Typical remedy
Linearity Residuals vs fitted, partial-residual plots Add polynomials/splines
Independence Durbin-Watson, ACF GLS, mixed models, robust SEs
Homoscedasticity Residuals vs fitted, BP test Transform y, WLS, robust SEs
Normality Q-Q plot, Shapiro-Wilk Transform y, bootstrap
Multicollinearity VIF, condition number Drop/combine features, regularize
Specification Ramsey RESET, residual patterns Add terms, re-specify model

Influence, leverage, and outliers — who's pulling the cart?

  • Leverage measures how far an observation's X is from the mean X. High leverage = potential to influence the fit.
  • Residual shows how much the model misses y for that point. A big residual + high leverage = dangerous.
  • Cook's distance combines leverage and residual to quantify influence. Points with large Cook's D deserve scrutiny.

Practical rule-of-thumb: investigate points with Cook's D > 4/n, or large standardized residuals (>3), or leverage > 2p/n.


Hands-on mini recipe (pseudo-Python)

# assume statsmodels, sklearn, and numpy imported
# fit OLS with statsmodels for rich diagnostics
model = sm.OLS(y, X).fit()
print(model.summary())  # includes coefficients, SEs, R2

# diagnostic plots
resid = model.resid
fitted = model.fittedvalues
qqplot(resid)  # visually inspect normality
plot(fitted, resid, '.')  # residuals vs fitted

# heteroskedasticity
breusch_pagan_test = sms.het_breuschpagan(resid, model.model.exog)

# multicollinearity
vifs = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

# influence
influence = model.get_influence()
(c, p) = influence.cooks_distance

# if problems: try transformations / robust SEs / regularization
model_robust = model.get_robustcov_results(cov_type='HC3')

(If you prefer tidy sklearn pipelines, use statsmodels for inference diagnostics and sklearn for cross-validated predictive tuning.)


Closing: Checklist and next steps

  • Always plot residuals vs fitted and a Q-Q plot. If you don't do anything else, do those two.
  • Compute VIFs to check for multicollinearity.
  • Use tests (Breusch-Pagan, Durbin-Watson) to confirm visual impressions, not replace them.
  • If you find issues, ask: am I trying to make better predictions or make robust inference? Your remedy differs.

Key takeaways:

  • Regression assumptions are not optional decorations — they are the rules that give your estimates meaning.
  • Diagnostics are mostly visual but backed by tests; both matter.
  • Remedies include transformations, weighting, robust SEs, model re-specification, or switching model families.

Next logical move (spoiler from the course roadmap): after you can diagnose and fix linear-model issues, we'll explore robust regression techniques and generalized linear models — tools for when assumptions are politely, persistently violated.

Final TA-level parting shot: "A model that fits the data but fails diagnostics is like a student who memorized the homework answers — maybe they can pass the test, but they didn't actually learn anything."


Suggested exercises: pick a real dataset (or your project data) and run the full diagnostics above. Document each violation you find, how you tested it, and the steps you took to fix or account for it. That write-up will be the difference between a cute chart and a credible analysis.

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics