Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

Simple Linear Regression Geometry Multiple Linear Regression Formulation Assumptions and Diagnostics Ordinary Least Squares Solution Gradient Descent for OLS Heteroscedasticity and Robust Losses Transformations of Targets and Features Categorical Variables in Regression Interaction Terms in Linear Models Multicollinearity and VIF Prediction Intervals vs Confidence Intervals Feature Scaling Effects in OLS Handling Outliers with Huber and Quantile Loss Model Interpretation with Coefficients Baseline and Dummy Regressors

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Regression I: Linear Models

Regression I: Linear Models

24984 views

Build and diagnose linear regression models, understand assumptions, and evaluate predictive performance.

Content

5 of 15

Gradient Descent for OLS

Gradient Descent: Chill but Precise

3327 views

intermediate

humorous

machine learning

gpt-5-mini

3327 views

Versions:

Gradient Descent: Chill but Precise

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Gradient Descent for OLS — The Slow-Cooked Exactness (but Faster)

"You already saw the exact recipe in class: the OLS closed-form. Now meet the sous-chef who actually cooks when the kitchen is enormous."

You learned the Ordinary Least Squares closed-form solution earlier (β = (XᵀX)⁻¹Xᵀy). You also learned the assumptions and diagnostics that tell you whether that solution is trustworthy. Now we’ll pick up where that left off and ask: when does the closed-form solution become impractical, and how do we use Gradient Descent (GD) to minimize the same Ordinary Least Squares (OLS) loss? We’ll cover the math, the variants (batch / stochastic / mini-batch), practical tips (scaling, learning rates, early stopping), interactions with cross-validation and leakage, and a few battle-tested heuristics.

Quick reminder: objective we minimize

We want to minimize the Residual Sum of Squares (RSS). A convenient normalized version (mean squared error) with a 1/2 factor makes gradients prettier:

J(β) = (1 / (2n)) ∥y − Xβ∥²

n = number of samples
X is n×p, β is p×1

The closed-form β* solves ∇J(β)=0 → β* = (XᵀX)⁻¹Xᵀy (assuming invertibility). But for large n or p, or streaming data, that matrix inverse is expensive or unstable. Enter Gradient Descent.

Derive the gradient (fast napkin math)

Compute the gradient with respect to β:

∇J(β) = −(1/n) Xᵀ(y − Xβ)

So the classic batch gradient descent update is:

β ← β − α ∇J(β) = β + (α / n) Xᵀ(y − Xβ)

Where α is the learning rate. (If you used the 1/2 factor above, this gradient matches the update cleanly.)

Why use Gradient Descent for OLS? (When closed-form chokes)

Large-scale data: computing XᵀX or its inverse is O(p³) (or at least O(p²n)) — not great when p ~ 100k.
Streaming / online learning: samples arrive over time; you can update incrementally with SGD.
Memory constraints: avoiding storing large matrices.
Numerically more stable with careful regularization and iterative methods.

But remember: GD doesn't magically validate OLS assumptions. You still run diagnostics (residuals, heteroskedasticity tests, influential points) as before.

Variants: Batch vs Stochastic vs Mini-batch (the family reunion)

Method	Update uses	Pros	Cons
Batch GD	All n samples	Smooth descent, stable gradients	Expensive per step, slow for big n
Stochastic GD (SGD)	One sample at a time	Cheap updates, quick progress, online	Noisy, needs lower learning rates, bumpy loss curve
Mini-batch GD	b samples (e.g., 32–1024)	Best practical tradeoff — vectorized and parallelizable	Need to tune batch size

Practical default: mini-batch with batch sizes tuned to hardware (e.g., 256) unless you're in a streaming/online setting (then use SGD).

Pseudocode — mini-batch GD for OLS

# simple, numpy-ish pseudocode
initialize beta = zeros(p)
for epoch in 1..max_epochs:
  shuffle training data
  for each mini-batch (X_b, y_b):
    pred = X_b @ beta
    grad = - (1 / len(X_b)) * X_b.T @ (y_b - pred)
    beta = beta - lr * grad
  optionally compute val_loss and early-stop

Add momentum or Adam if you like faster convergence, but keep in mind those optimizers can hide learning dynamics and make diagnostics trickier.

Hyperparameters & heuristics (the stuff your textbook forgot to dramatize)

Learning rate (α): The single most important knob. Too big → divergence. Too small → glacial progress. Start with 0.01 or 0.001 and try log-grid search.
Feature scaling: Absolutely necessary. Standardize each feature (zero mean, unit variance) within the training fold. If you standardize on the whole dataset before cross-validation, you leak info. (Yes, we remembered the last lecture on leakage.)
Batch size: Small batches add noise (good regularizer). Big batches are more stable and hardware-friendly.
Initialization: Zero or small random values both fine for linear models.
Stopping criteria: max epochs, tolerance on loss change, or early stopping based on validation loss.
Momentum / Adam: Speeds up convergence. For linear regression, classical SGD with a tuned learning rate and maybe momentum often suffices.

Regularization & gradients — Ridge example

If you add L2 regularization (Ridge), the objective becomes:

J(β) = (1 / (2n)) ∥y − Xβ∥² + (λ / 2) ∥β∥²

Gradient: ∇J(β) = −(1/n) Xᵀ(y − Xβ) + λβ

Update: β ← β − α (∇J(β))

Note: Ridge also has a closed-form (β = (XᵀX + nλI)⁻¹Xᵀy). If you want to regularize to deal with collinearity, ridge is often a better solution than relying on GD noise.

Cross-validation, leakage, and early stopping — the correct choreography

When you put GD inside cross-validation or training/validation loops, pay attention to these traps:

Scaling per fold: Fit scalers (mean/std) on the training fold only, then apply to validation/test fold. Otherwise you leak target-relevant info.
Reset optimization state per fold: Reinitialize β and optimizer state for each CV fold. Don't carry momentum or moments across folds — that's data leakage in disguise.
Early stopping uses validation loss: Monitor validation loss within the training fold. Early stopping is a form of regularization — track it on held-out data only.

These points are natural continuations of your training/validation/test lecture: GD gives new ways to overfit (or regularize), so evaluation discipline matters.

Practical troubleshooting — what to do when training misbehaves

Loss diverges quickly → lower α by 10×, re-scale features.
Loss decreases but very slowly → increase α, add momentum, or use Adam.
Weird coefficients → check for unscaled features, outliers, or multicollinearity (variance inflation).
Validation loss rises → you've overfit — try early stopping, more data, or stronger regularization.

Final takeaway (bite-sized and slightly melodramatic)

Gradient Descent is an iterative alternative to the OLS closed-form. Use it when data is huge, streaming, or when you want incremental updates.
Always scale features and avoid leakage: fit scalers and optimizer state only on training folds when cross-validating.
Use early stopping as a validation-driven regularizer; monitor validation loss as in our previous lesson on train/validation/test splits.

"Matrix inverses look nice on a whiteboard. In the real world, patience, a good learning rate, and proper scaling win the race."

Go forth: implement a mini-batch GD for a real dataset (house prices, ad clicks, or synthetic data), instrument training curves, and watch how early stopping and learning rates change your life. If your loss curve looks like a cliff or a skateboard ramp, you did something interesting — probably not the good kind.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics