Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

Ridge Regression Fundamentals Lasso Regression and Sparsity Elastic Net and Mixing Parameter Choosing Regularization Strength Coordinate Descent Algorithms Cross-Validated Regularization Paths Polynomial Regression with Regularization Generalized Additive Models Overview Quantile Regression Applications Poisson and Negative Binomial Regression Robust Regression Techniques Feature Selection via L1 Penalty Bayesian Linear Regression Basics Multitask and Multioutput Regression Nonlinear Regression with Kernels

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Regression II: Regularization and Advanced Techniques

Regression II: Regularization and Advanced Techniques

25590 views

Control complexity and improve generalization using ridge, lasso, elastic net, and specialized regressors.

Content

1 of 15

Ridge Regression Fundamentals

Ridge Regression, Sass & Math

4322 views

intermediate

humorous

visual

science

gpt-5-mini

4322 views

Versions:

Ridge Regression, Sass & Math

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Ridge Regression Fundamentals — Shrink Those Coefficients (Gently)

"Remember when we trusted ordinary least squares like it was our childhood blanket? Cute. Ridge is the grown-up version: same blanket, but with duct tape and a spreadsheet."

You already know how to fit a linear model, interpret coefficients, and wrestle with outliers. You've seen how ordinary least squares (OLS) gives us unbiased estimates when assumptions hold, but also how coefficients explode when features are correlated or when we overfit. Welcome to Ridge Regression: the polite way of telling large coefficients to calm down.

What Ridge Regression Actually Does (Quick, Beautiful Intuition)

At its core, Ridge regression adds a penalty to the OLS loss that punishes large coefficients. Instead of minimizing just the residual sum of squares (RSS), Ridge minimizes:

Loss = RSS + alpha * sum(beta_j^2)

More formally:

argmin_beta ||y - X beta||^2_2 + alpha ||beta||^2_2

alpha (sometimes called lambda) controls the strength of the penalty.
The penalty is the L2 norm of the coefficient vector: it shrinks coefficients toward zero but does not set them exactly to zero.

Geometric image: OLS finds the point where elliptical RSS contours meet the axes of coefficients. Ridge says: "also stay inside this ball of radius determined by alpha." The intersection slides you to a more conservative point with smaller coefficients.

Why we need Ridge — a reminder without repeating the intro

You have seen problems in earlier lessons:

Coefficients exploding when features are correlated (multicollinearity).
High variance when p is large relative to n, or when features are noisy.

Ridge directly targets those issues by shrinking the coefficient vector toward the origin, trading a bit of bias for lower variance. This is a textbook bias–variance tradeoff win: better predictive performance out-of-sample.

Two quick lenses: Algebra and Bayesian

Algebraic neatness

OLS closed form is beta_hat = (X^T X)^{-1} X^T y. But when X^T X is nearly singular (multicollinearity), that inverse is unstable.

Ridge fixes it:

beta_ridge = (X^T X + alpha I)^{-1} X^T y

Adding alpha I ensures the matrix is invertible and well-conditioned: no wild swings when tiny changes occur in the data.

Bayesian interpretation (deliciously short)

If you place a zero-mean Gaussian prior on coefficients with variance proportional to 1/alpha, then the MAP estimate under a Gaussian noise model is exactly the Ridge solution. So Ridge = OLS + a prior saying "I believe coefficients are small unless data screams otherwise." Subtle, classy skepticism.

Practical things you must do (or suffer)

Standardize features first. Ridge is sensitive to scale. Without standardization, features with bigger magnitudes get punished unfairly.
Alpha selection via cross-validation. Use k-fold CV to pick alpha that minimizes validation error. No guessing games.
Ridge does not do variable selection. Unlike Lasso (L1), Ridge shrinks coefficients but keeps them nonzero. So for interpretability and selection, combine Ridge with other methods.
Interpret coefficients carefully. Shrinkage changes magnitude; you cannot read coefficients the same as unbiased OLS coefficients.

How Ridge behaves as alpha shifts

alpha -> 0: Ridge -> OLS. No shrinkage.
alpha -> infinity: coefficients -> 0 (predicts the mean if model has intercept).

Think of alpha as thermostat. Too low and the room is wild; too high and you freeze to zero.

A tiny example in words (multicollinearity drama)

Imagine two features, x1 and x2, that are 99% correlated. OLS will produce two large but opposite-signed coefficients that cancel and make predictions okay but with terrible instability. Ridge says: "Nope, both of you shrink." The coefficients become smaller, balanced, and predictions become much less noisy when new data arrives.

SVD perspective (for the brave and curious)

If X = U Sigma V^T (SVD), Ridge solution scales each singular direction by factor sigma_i/(sigma_i^2 + alpha). Small singular values (directions with little information/noise) get crushed. Ridge is a soft filter that kills noisy directions while preserving signal.

Quick comparison table

Method	Penalty	Variable selection	Use when...
OLS	none	no	features few and clean, no multicollinearity
Ridge	L2	no	multicollinearity, lots of small noisy predictors
Lasso	L1	yes	you want sparsity/selection

Pseudocode / sklearn snippet

# assume X is standardized and y centered (or use StandardScaler)
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)
model.fit(X_train, y_train)
preds = model.predict(X_test)

To tune alpha:

from sklearn.model_selection import GridSearchCV
alphas = np.logspace(-4, 4, 50)
grid = GridSearchCV(Ridge(), {'alpha': alphas}, cv=5)
grid.fit(X, y)
best_alpha = grid.best_params_['alpha']

Common questions you should ask (and answer)

Why not always use Ridge? Because if you need interpretability via zeros, Ridge won't give it; if features are truly few and assumptions hold, OLS is unbiased and fine. Also, if sparsity is real, Lasso might be better.
Do we always standardize? Yes, unless your features already live on the same scale and you have a very specific reason not to.
Can Ridge help with outliers? Not really; Ridge deals with coefficient stability. For outliers, you already learned Huber and Quantile methods.

Closing: Key takeaways (memorize these like a ritual)

Ridge = OLS + L2 penalty. Shrinks coefficients, reduces variance, combats multicollinearity.
Scale your features. Always. Please. Do it.
Tune alpha with CV. There is no universal alpha that works for everything.
No sparsity. Ridge keeps variables in play; use Lasso or elastic net if you want zeros.
SVD & Bayesian views help intuition. Ridge filters noisy directions; it assumes small coefficients are more likely.

Final thought: When your model looks like a nervous overfit mess, Ridge is a calm, rational friend who hands your coefficients a latte and tells them to breathe.

Go try it on your last project: standardize, grid-search alpha, compare validation curves, and watch variance shrink. Next lesson: Elastic Net — the diplomatic compromise between Ridge's moderation and Lasso's ruthlessness.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics