jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Supervised Machine Learning: Regression and Classification
Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

Simple Linear Regression GeometryMultiple Linear Regression FormulationAssumptions and DiagnosticsOrdinary Least Squares SolutionGradient Descent for OLSHeteroscedasticity and Robust LossesTransformations of Targets and FeaturesCategorical Variables in RegressionInteraction Terms in Linear ModelsMulticollinearity and VIFPrediction Intervals vs Confidence IntervalsFeature Scaling Effects in OLSHandling Outliers with Huber and Quantile LossModel Interpretation with CoefficientsBaseline and Dummy Regressors

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Regression I: Linear Models

Regression I: Linear Models

24980 views

Build and diagnose linear regression models, understand assumptions, and evaluate predictive performance.

Content

4 of 15

Ordinary Least Squares Solution

The No-Chill Breakdown: OLS Edition
2786 views
intermediate
humorous
visual
science
gpt-5-mini
2786 views

Versions:

The No-Chill Breakdown: OLS Edition

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Ordinary Least Squares (OLS) — The No-Chill Breakdown

"If fitting a line were a relationship, OLS would be that brutally honest friend who minimizes regret (and the sum of squared errors)."

You're coming in hot from: Multiple Linear Regression (position 2) — where we set up X, y, and β — and Assumptions & Diagnostics (position 3) — where we learned which broken assumptions make OLS throw a tantrum. You also just wrestled with Train/Validation/Test and Cross-Validation strategies, so you know how to evaluate models without leaking future info. Good. We're doing the math now — but with personality.


Quick refresher (whispered): the setup

We assume the familiar linear model from earlier:

y = X β + ε
  • y is n×1 (responses)
  • X is n×(p+1) (design matrix, usually with a column of ones for intercept)
  • β is (p+1)×1 (coefficients)
  • ε is n×1 (noise)

You already saw why the assumptions about ε (zero mean, homoscedasticity, no autocorrelation, exogeneity) matter in Diagnostics. Those assumptions will show up here as conditions that make OLS behave like a saint.


The objective: minimize squared mistakes

OLS picks β to minimize the sum of squared residuals (RSS):

RSS(β) = (y - Xβ)ᵀ (y - Xβ)

Why squared? Because squares punish large errors extra-hard and give a smooth differentiable objective we can optimize analytically. Also because quadratic = solvable = joy.

Normal equations — the classic closed form

Differentiate RSS wrt β and set gradient to zero:

∂RSS/∂β = -2 Xᵀ (y - Xβ) = 0

Solve for β:

Xᵀ X β̂ = Xᵀ y
β̂ = (Xᵀ X)^{-1} Xᵀ y     <-- the OLS estimator (normal equation)

This is the standard formula. When XᵀX is invertible, life is bliss: closed form, deterministic, and fast for small p.


What makes this estimator good? (Gauss–Markov cameo)

Under the linear model with: (1) E[ε]=0, (2) Var(ε)=σ²I (homoscedasticity & no correlation), and (3) X fixed (or independent of ε), the Gauss–Markov theorem says:

β̂ is the Best Linear Unbiased Estimator (BLUE).

Best = minimum variance among linear unbiased estimators. Unbiased = E[β̂]=β. Linear = estimator is a linear function of y.

Note: BLUE does not mean it's the best possible estimator overall (non-linear or biased estimators like ridge might beat it in MSE when multicollinearity or overfitting are problems).


Computational realities & the drama of XᵀX

When is (XᵀX) invertible? When X has full column rank (no exact multicollinearity). Real-world problems:

  • p close to n (or p>n): XᵀX singular or ill-conditioned
  • Highly correlated predictors → large condition number → huge variance in β̂

Strategies:

  • Use the pseudo-inverse via SVD: β̂ = X⁺ y (stable)
  • Regularization (ridge, LASSO) — links to model selection and diagnostics from earlier
  • Use gradient-based solvers (useful when n and p are huge)

Table: solution methods at a glance

Method Formula / Idea When to use
Normal eqn β̂ = (XᵀX)^{-1} Xᵀ y Small p, XᵀX invertible
SVD / pseudo-inverse β̂ = V Σ⁻¹ Uᵀ y Numerical stability, rank-deficient X
Gradient descent iteratively minimize RSS Very large n/p, streaming data
Regularized OLS (ridge) β̂ = (XᵀX + λI)^{-1} Xᵀ y Multicollinearity, lower variance

Gradient descent quick recipe (pseudocode)

initialize β randomly
for t in 1..T:
    gradient = -2 Xᵀ (y - X β)
    β = β - η * gradient
return β

Pros: memory-friendly, works with minibatches, easy to add regularization.
Cons: needs tuning (η), slower to converge to machine precision.


Diagnostics & evaluation — don't be sloppy

You already know to evaluate models using proper resampling (train/validation/test) and cross-validation. Important notes for OLS:

  • Fit β̂ on the training set only. Always. Otherwise you leak information and your CV / test MSE is lying.
  • Use k-fold CV to compare plain OLS to ridge/LASSO when worried about variance.
  • Learning curves (VVV from previous topic): plot training vs validation error as n grows. If validation error drops and stabilizes, OLS is being useful; if it diverges, consider regularization or richer features.

Quick checklist before trusting β̂:

  1. Check multicollinearity (VIFs) — high VIFs → inflated variances
  2. Inspect residuals (from Diagnostics) — non-constant variance or autocorrelation invalidates Gauss–Markov
  3. Identify outliers & high-leverage points — they disproportionately influence β̂

Remember: OLS tells you what fits the data under its assumptions. It doesn't magically grant causality or robustness.


A few juicy insights (so you actually remember this):

  • OLS is a balancing act. It minimizes squared vertical distances. If you care about symmetric errors or quantiles, use other losses.
  • In high dimensions, unbiased ≠ low error. Biased regularized estimators (ridge) often win in MSE. That's why we fold cross-validation into model selection.
  • Stability beats exactness. Numerically stable methods (SVD) are often preferable to a literal matrix inverse — especially with real data's mess.

Concrete mini-workflow (where math meets hygiene)

  1. Standardize predictors (if interpretability or regularization matters).
  2. Split data correctly (train/validation/test). No peeking. Use CV for λ tuning.
  3. Compute OLS via SVD or normal equations (if safe).
  4. Run diagnostics: residual plots, VIF, Cook's distance.
  5. If problems → try ridge/LASSO, transformation, or remove / combine predictors.
  6. Report test MSE and confidence intervals for β̂ (if assumptions hold).

Code snippet (conceptual) for k-fold CV comparing OLS vs Ridge:

for λ in grid:
  cv_errors[λ] = kfold_cv(train_X, train_y, fit=ridge(λ))
choose λ* = argmin cv_errors
fit final model on full training set with λ*
evaluate on test set

Final mic drop — TL;DR

  • OLS = closed-form β̂ = (XᵀX)^{-1}Xᵀy when XᵀX invertible.
  • Gauss–Markov: under standard assumptions, OLS is the best linear unbiased estimator.
  • In practice watch out for multicollinearity, outliers, heteroscedasticity, and numerical instability.
  • Use SVD / pseudo-inverse or gradient methods when needed, and fold model selection into proper CV to avoid leakage.

If you remember one thing: OLS is elegant and powerful, but it’s not invincible. Treat it like a noble but fragile friend — give it good data, protect it from leakage, and don’t let it near highly collinear variables without supervision.

Go run some fits, make some plots, and let the residuals talk back to you. Need help implementing OLS with SVD or writing a tidy CV loop? Say the word and I’ll memeify the code for you.

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics