Regression I: Linear Models
Build and diagnose linear regression models, understand assumptions, and evaluate predictive performance.
Content
Gradient Descent for OLS
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Gradient Descent for OLS — The Slow-Cooked Exactness (but Faster)
"You already saw the exact recipe in class: the OLS closed-form. Now meet the sous-chef who actually cooks when the kitchen is enormous."
You learned the Ordinary Least Squares closed-form solution earlier (β = (XᵀX)⁻¹Xᵀy). You also learned the assumptions and diagnostics that tell you whether that solution is trustworthy. Now we’ll pick up where that left off and ask: when does the closed-form solution become impractical, and how do we use Gradient Descent (GD) to minimize the same Ordinary Least Squares (OLS) loss? We’ll cover the math, the variants (batch / stochastic / mini-batch), practical tips (scaling, learning rates, early stopping), interactions with cross-validation and leakage, and a few battle-tested heuristics.
Quick reminder: objective we minimize
We want to minimize the Residual Sum of Squares (RSS). A convenient normalized version (mean squared error) with a 1/2 factor makes gradients prettier:
J(β) = (1 / (2n)) ∥y − Xβ∥²
- n = number of samples
- X is n×p, β is p×1
The closed-form β* solves ∇J(β)=0 → β* = (XᵀX)⁻¹Xᵀy (assuming invertibility). But for large n or p, or streaming data, that matrix inverse is expensive or unstable. Enter Gradient Descent.
Derive the gradient (fast napkin math)
Compute the gradient with respect to β:
∇J(β) = −(1/n) Xᵀ(y − Xβ)
So the classic batch gradient descent update is:
β ← β − α ∇J(β) = β + (α / n) Xᵀ(y − Xβ)
Where α is the learning rate. (If you used the 1/2 factor above, this gradient matches the update cleanly.)
Why use Gradient Descent for OLS? (When closed-form chokes)
- Large-scale data: computing XᵀX or its inverse is O(p³) (or at least O(p²n)) — not great when p ~ 100k.
- Streaming / online learning: samples arrive over time; you can update incrementally with SGD.
- Memory constraints: avoiding storing large matrices.
- Numerically more stable with careful regularization and iterative methods.
But remember: GD doesn't magically validate OLS assumptions. You still run diagnostics (residuals, heteroskedasticity tests, influential points) as before.
Variants: Batch vs Stochastic vs Mini-batch (the family reunion)
| Method | Update uses | Pros | Cons |
|---|---|---|---|
| Batch GD | All n samples | Smooth descent, stable gradients | Expensive per step, slow for big n |
| Stochastic GD (SGD) | One sample at a time | Cheap updates, quick progress, online | Noisy, needs lower learning rates, bumpy loss curve |
| Mini-batch GD | b samples (e.g., 32–1024) | Best practical tradeoff — vectorized and parallelizable | Need to tune batch size |
Practical default: mini-batch with batch sizes tuned to hardware (e.g., 256) unless you're in a streaming/online setting (then use SGD).
Pseudocode — mini-batch GD for OLS
# simple, numpy-ish pseudocode
initialize beta = zeros(p)
for epoch in 1..max_epochs:
shuffle training data
for each mini-batch (X_b, y_b):
pred = X_b @ beta
grad = - (1 / len(X_b)) * X_b.T @ (y_b - pred)
beta = beta - lr * grad
optionally compute val_loss and early-stop
Add momentum or Adam if you like faster convergence, but keep in mind those optimizers can hide learning dynamics and make diagnostics trickier.
Hyperparameters & heuristics (the stuff your textbook forgot to dramatize)
- Learning rate (α): The single most important knob. Too big → divergence. Too small → glacial progress. Start with 0.01 or 0.001 and try log-grid search.
- Feature scaling: Absolutely necessary. Standardize each feature (zero mean, unit variance) within the training fold. If you standardize on the whole dataset before cross-validation, you leak info. (Yes, we remembered the last lecture on leakage.)
- Batch size: Small batches add noise (good regularizer). Big batches are more stable and hardware-friendly.
- Initialization: Zero or small random values both fine for linear models.
- Stopping criteria: max epochs, tolerance on loss change, or early stopping based on validation loss.
- Momentum / Adam: Speeds up convergence. For linear regression, classical SGD with a tuned learning rate and maybe momentum often suffices.
Regularization & gradients — Ridge example
If you add L2 regularization (Ridge), the objective becomes:
J(β) = (1 / (2n)) ∥y − Xβ∥² + (λ / 2) ∥β∥²
Gradient: ∇J(β) = −(1/n) Xᵀ(y − Xβ) + λβ
Update: β ← β − α (∇J(β))
Note: Ridge also has a closed-form (β = (XᵀX + nλI)⁻¹Xᵀy). If you want to regularize to deal with collinearity, ridge is often a better solution than relying on GD noise.
Cross-validation, leakage, and early stopping — the correct choreography
When you put GD inside cross-validation or training/validation loops, pay attention to these traps:
- Scaling per fold: Fit scalers (mean/std) on the training fold only, then apply to validation/test fold. Otherwise you leak target-relevant info.
- Reset optimization state per fold: Reinitialize β and optimizer state for each CV fold. Don't carry momentum or moments across folds — that's data leakage in disguise.
- Early stopping uses validation loss: Monitor validation loss within the training fold. Early stopping is a form of regularization — track it on held-out data only.
These points are natural continuations of your training/validation/test lecture: GD gives new ways to overfit (or regularize), so evaluation discipline matters.
Practical troubleshooting — what to do when training misbehaves
- Loss diverges quickly → lower α by 10×, re-scale features.
- Loss decreases but very slowly → increase α, add momentum, or use Adam.
- Weird coefficients → check for unscaled features, outliers, or multicollinearity (variance inflation).
- Validation loss rises → you've overfit — try early stopping, more data, or stronger regularization.
Final takeaway (bite-sized and slightly melodramatic)
- Gradient Descent is an iterative alternative to the OLS closed-form. Use it when data is huge, streaming, or when you want incremental updates.
- Always scale features and avoid leakage: fit scalers and optimizer state only on training folds when cross-validating.
- Use early stopping as a validation-driven regularizer; monitor validation loss as in our previous lesson on train/validation/test splits.
"Matrix inverses look nice on a whiteboard. In the real world, patience, a good learning rate, and proper scaling win the race."
Go forth: implement a mini-batch GD for a real dataset (house prices, ad clicks, or synthetic data), instrument training curves, and watch how early stopping and learning rates change your life. If your loss curve looks like a cliff or a skateboard ramp, you did something interesting — probably not the good kind.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!