jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Supervised Machine Learning: Regression and Classification
Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

Simple Linear Regression GeometryMultiple Linear Regression FormulationAssumptions and DiagnosticsOrdinary Least Squares SolutionGradient Descent for OLSHeteroscedasticity and Robust LossesTransformations of Targets and FeaturesCategorical Variables in RegressionInteraction Terms in Linear ModelsMulticollinearity and VIFPrediction Intervals vs Confidence IntervalsFeature Scaling Effects in OLSHandling Outliers with Huber and Quantile LossModel Interpretation with CoefficientsBaseline and Dummy Regressors

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Regression I: Linear Models

Regression I: Linear Models

24984 views

Build and diagnose linear regression models, understand assumptions, and evaluate predictive performance.

Content

5 of 15

Gradient Descent for OLS

Gradient Descent: Chill but Precise
3327 views
intermediate
humorous
machine learning
gpt-5-mini
3327 views

Versions:

Gradient Descent: Chill but Precise

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Gradient Descent for OLS — The Slow-Cooked Exactness (but Faster)

"You already saw the exact recipe in class: the OLS closed-form. Now meet the sous-chef who actually cooks when the kitchen is enormous."

You learned the Ordinary Least Squares closed-form solution earlier (β = (XᵀX)⁻¹Xᵀy). You also learned the assumptions and diagnostics that tell you whether that solution is trustworthy. Now we’ll pick up where that left off and ask: when does the closed-form solution become impractical, and how do we use Gradient Descent (GD) to minimize the same Ordinary Least Squares (OLS) loss? We’ll cover the math, the variants (batch / stochastic / mini-batch), practical tips (scaling, learning rates, early stopping), interactions with cross-validation and leakage, and a few battle-tested heuristics.


Quick reminder: objective we minimize

We want to minimize the Residual Sum of Squares (RSS). A convenient normalized version (mean squared error) with a 1/2 factor makes gradients prettier:

J(β) = (1 / (2n)) ∥y − Xβ∥²

  • n = number of samples
  • X is n×p, β is p×1

The closed-form β* solves ∇J(β)=0 → β* = (XᵀX)⁻¹Xᵀy (assuming invertibility). But for large n or p, or streaming data, that matrix inverse is expensive or unstable. Enter Gradient Descent.


Derive the gradient (fast napkin math)

Compute the gradient with respect to β:

∇J(β) = −(1/n) Xᵀ(y − Xβ)

So the classic batch gradient descent update is:

β ← β − α ∇J(β) = β + (α / n) Xᵀ(y − Xβ)

Where α is the learning rate. (If you used the 1/2 factor above, this gradient matches the update cleanly.)


Why use Gradient Descent for OLS? (When closed-form chokes)

  • Large-scale data: computing XᵀX or its inverse is O(p³) (or at least O(p²n)) — not great when p ~ 100k.
  • Streaming / online learning: samples arrive over time; you can update incrementally with SGD.
  • Memory constraints: avoiding storing large matrices.
  • Numerically more stable with careful regularization and iterative methods.

But remember: GD doesn't magically validate OLS assumptions. You still run diagnostics (residuals, heteroskedasticity tests, influential points) as before.


Variants: Batch vs Stochastic vs Mini-batch (the family reunion)

Method Update uses Pros Cons
Batch GD All n samples Smooth descent, stable gradients Expensive per step, slow for big n
Stochastic GD (SGD) One sample at a time Cheap updates, quick progress, online Noisy, needs lower learning rates, bumpy loss curve
Mini-batch GD b samples (e.g., 32–1024) Best practical tradeoff — vectorized and parallelizable Need to tune batch size

Practical default: mini-batch with batch sizes tuned to hardware (e.g., 256) unless you're in a streaming/online setting (then use SGD).


Pseudocode — mini-batch GD for OLS

# simple, numpy-ish pseudocode
initialize beta = zeros(p)
for epoch in 1..max_epochs:
  shuffle training data
  for each mini-batch (X_b, y_b):
    pred = X_b @ beta
    grad = - (1 / len(X_b)) * X_b.T @ (y_b - pred)
    beta = beta - lr * grad
  optionally compute val_loss and early-stop

Add momentum or Adam if you like faster convergence, but keep in mind those optimizers can hide learning dynamics and make diagnostics trickier.


Hyperparameters & heuristics (the stuff your textbook forgot to dramatize)

  • Learning rate (α): The single most important knob. Too big → divergence. Too small → glacial progress. Start with 0.01 or 0.001 and try log-grid search.
  • Feature scaling: Absolutely necessary. Standardize each feature (zero mean, unit variance) within the training fold. If you standardize on the whole dataset before cross-validation, you leak info. (Yes, we remembered the last lecture on leakage.)
  • Batch size: Small batches add noise (good regularizer). Big batches are more stable and hardware-friendly.
  • Initialization: Zero or small random values both fine for linear models.
  • Stopping criteria: max epochs, tolerance on loss change, or early stopping based on validation loss.
  • Momentum / Adam: Speeds up convergence. For linear regression, classical SGD with a tuned learning rate and maybe momentum often suffices.

Regularization & gradients — Ridge example

If you add L2 regularization (Ridge), the objective becomes:

J(β) = (1 / (2n)) ∥y − Xβ∥² + (λ / 2) ∥β∥²

Gradient: ∇J(β) = −(1/n) Xᵀ(y − Xβ) + λβ

Update: β ← β − α (∇J(β))

Note: Ridge also has a closed-form (β = (XᵀX + nλI)⁻¹Xᵀy). If you want to regularize to deal with collinearity, ridge is often a better solution than relying on GD noise.


Cross-validation, leakage, and early stopping — the correct choreography

When you put GD inside cross-validation or training/validation loops, pay attention to these traps:

  1. Scaling per fold: Fit scalers (mean/std) on the training fold only, then apply to validation/test fold. Otherwise you leak target-relevant info.
  2. Reset optimization state per fold: Reinitialize β and optimizer state for each CV fold. Don't carry momentum or moments across folds — that's data leakage in disguise.
  3. Early stopping uses validation loss: Monitor validation loss within the training fold. Early stopping is a form of regularization — track it on held-out data only.

These points are natural continuations of your training/validation/test lecture: GD gives new ways to overfit (or regularize), so evaluation discipline matters.


Practical troubleshooting — what to do when training misbehaves

  • Loss diverges quickly → lower α by 10×, re-scale features.
  • Loss decreases but very slowly → increase α, add momentum, or use Adam.
  • Weird coefficients → check for unscaled features, outliers, or multicollinearity (variance inflation).
  • Validation loss rises → you've overfit — try early stopping, more data, or stronger regularization.

Final takeaway (bite-sized and slightly melodramatic)

  • Gradient Descent is an iterative alternative to the OLS closed-form. Use it when data is huge, streaming, or when you want incremental updates.
  • Always scale features and avoid leakage: fit scalers and optimizer state only on training folds when cross-validating.
  • Use early stopping as a validation-driven regularizer; monitor validation loss as in our previous lesson on train/validation/test splits.

"Matrix inverses look nice on a whiteboard. In the real world, patience, a good learning rate, and proper scaling win the race."

Go forth: implement a mini-batch GD for a real dataset (house prices, ad clicks, or synthetic data), instrument training curves, and watch how early stopping and learning rates change your life. If your loss curve looks like a cliff or a skateboard ramp, you did something interesting — probably not the good kind.

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics