Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

Simple Linear Regression Geometry Multiple Linear Regression Formulation Assumptions and Diagnostics Ordinary Least Squares Solution Gradient Descent for OLS Heteroscedasticity and Robust Losses Transformations of Targets and Features Categorical Variables in Regression Interaction Terms in Linear Models Multicollinearity and VIF Prediction Intervals vs Confidence Intervals Feature Scaling Effects in OLS Handling Outliers with Huber and Quantile Loss Model Interpretation with Coefficients Baseline and Dummy Regressors

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Regression I: Linear Models

Regression I: Linear Models

24992 views

Build and diagnose linear regression models, understand assumptions, and evaluate predictive performance.

Content

4 of 15

Ordinary Least Squares Solution

The No-Chill Breakdown: OLS Edition

2789 views

intermediate

humorous

visual

science

gpt-5-mini

2789 views

Versions:

The No-Chill Breakdown: OLS Edition

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Ordinary Least Squares (OLS) — The No-Chill Breakdown

"If fitting a line were a relationship, OLS would be that brutally honest friend who minimizes regret (and the sum of squared errors)."

You're coming in hot from: Multiple Linear Regression (position 2) — where we set up X, y, and β — and Assumptions & Diagnostics (position 3) — where we learned which broken assumptions make OLS throw a tantrum. You also just wrestled with Train/Validation/Test and Cross-Validation strategies, so you know how to evaluate models without leaking future info. Good. We're doing the math now — but with personality.

Quick refresher (whispered): the setup

We assume the familiar linear model from earlier:

y = X β + ε

y is n×1 (responses)
X is n×(p+1) (design matrix, usually with a column of ones for intercept)
β is (p+1)×1 (coefficients)
ε is n×1 (noise)

You already saw why the assumptions about ε (zero mean, homoscedasticity, no autocorrelation, exogeneity) matter in Diagnostics. Those assumptions will show up here as conditions that make OLS behave like a saint.

The objective: minimize squared mistakes

OLS picks β to minimize the sum of squared residuals (RSS):

RSS(β) = (y - Xβ)ᵀ (y - Xβ)

Why squared? Because squares punish large errors extra-hard and give a smooth differentiable objective we can optimize analytically. Also because quadratic = solvable = joy.

Normal equations — the classic closed form

Differentiate RSS wrt β and set gradient to zero:

∂RSS/∂β = -2 Xᵀ (y - Xβ) = 0

Solve for β:

Xᵀ X β̂ = Xᵀ y
β̂ = (Xᵀ X)^{-1} Xᵀ y     <-- the OLS estimator (normal equation)

This is the standard formula. When XᵀX is invertible, life is bliss: closed form, deterministic, and fast for small p.

What makes this estimator good? (Gauss–Markov cameo)

Under the linear model with: (1) E[ε]=0, (2) Var(ε)=σ²I (homoscedasticity & no correlation), and (3) X fixed (or independent of ε), the Gauss–Markov theorem says:

β̂ is the Best Linear Unbiased Estimator (BLUE).

Best = minimum variance among linear unbiased estimators. Unbiased = E[β̂]=β. Linear = estimator is a linear function of y.

Note: BLUE does not mean it's the best possible estimator overall (non-linear or biased estimators like ridge might beat it in MSE when multicollinearity or overfitting are problems).

Computational realities & the drama of XᵀX

When is (XᵀX) invertible? When X has full column rank (no exact multicollinearity). Real-world problems:

p close to n (or p>n): XᵀX singular or ill-conditioned
Highly correlated predictors → large condition number → huge variance in β̂

Strategies:

Use the pseudo-inverse via SVD: β̂ = X⁺ y (stable)
Regularization (ridge, LASSO) — links to model selection and diagnostics from earlier
Use gradient-based solvers (useful when n and p are huge)

Table: solution methods at a glance

Method	Formula / Idea	When to use
Normal eqn	β̂ = (XᵀX)^{-1} Xᵀ y	Small p, XᵀX invertible
SVD / pseudo-inverse	β̂ = V Σ⁻¹ Uᵀ y	Numerical stability, rank-deficient X
Gradient descent	iteratively minimize RSS	Very large n/p, streaming data
Regularized OLS (ridge)	β̂ = (XᵀX + λI)^{-1} Xᵀ y	Multicollinearity, lower variance

Gradient descent quick recipe (pseudocode)

initialize β randomly
for t in 1..T:
    gradient = -2 Xᵀ (y - X β)
    β = β - η * gradient
return β

Pros: memory-friendly, works with minibatches, easy to add regularization.
Cons: needs tuning (η), slower to converge to machine precision.

Diagnostics & evaluation — don't be sloppy

You already know to evaluate models using proper resampling (train/validation/test) and cross-validation. Important notes for OLS:

Fit β̂ on the training set only. Always. Otherwise you leak information and your CV / test MSE is lying.
Use k-fold CV to compare plain OLS to ridge/LASSO when worried about variance.
Learning curves (VVV from previous topic): plot training vs validation error as n grows. If validation error drops and stabilizes, OLS is being useful; if it diverges, consider regularization or richer features.

Quick checklist before trusting β̂:

Check multicollinearity (VIFs) — high VIFs → inflated variances
Inspect residuals (from Diagnostics) — non-constant variance or autocorrelation invalidates Gauss–Markov
Identify outliers & high-leverage points — they disproportionately influence β̂

Remember: OLS tells you what fits the data under its assumptions. It doesn't magically grant causality or robustness.

A few juicy insights (so you actually remember this):

OLS is a balancing act. It minimizes squared vertical distances. If you care about symmetric errors or quantiles, use other losses.
In high dimensions, unbiased ≠ low error. Biased regularized estimators (ridge) often win in MSE. That's why we fold cross-validation into model selection.
Stability beats exactness. Numerically stable methods (SVD) are often preferable to a literal matrix inverse — especially with real data's mess.

Concrete mini-workflow (where math meets hygiene)

Standardize predictors (if interpretability or regularization matters).
Split data correctly (train/validation/test). No peeking. Use CV for λ tuning.
Compute OLS via SVD or normal equations (if safe).
Run diagnostics: residual plots, VIF, Cook's distance.
If problems → try ridge/LASSO, transformation, or remove / combine predictors.
Report test MSE and confidence intervals for β̂ (if assumptions hold).

Code snippet (conceptual) for k-fold CV comparing OLS vs Ridge:

for λ in grid:
  cv_errors[λ] = kfold_cv(train_X, train_y, fit=ridge(λ))
choose λ* = argmin cv_errors
fit final model on full training set with λ*
evaluate on test set

Final mic drop — TL;DR

OLS = closed-form β̂ = (XᵀX)^{-1}Xᵀy when XᵀX invertible.
Gauss–Markov: under standard assumptions, OLS is the best linear unbiased estimator.
In practice watch out for multicollinearity, outliers, heteroscedasticity, and numerical instability.
Use SVD / pseudo-inverse or gradient methods when needed, and fold model selection into proper CV to avoid leakage.

If you remember one thing: OLS is elegant and powerful, but it’s not invincible. Treat it like a noble but fragile friend — give it good data, protect it from leakage, and don’t let it near highly collinear variables without supervision.

Go run some fits, make some plots, and let the residuals talk back to you. Need help implementing OLS with SVD or writing a tidy CV loop? Say the word and I’ll memeify the code for you.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics