Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

Ridge Regression Fundamentals Lasso Regression and Sparsity Elastic Net and Mixing Parameter Choosing Regularization Strength Coordinate Descent Algorithms Cross-Validated Regularization Paths Polynomial Regression with Regularization Generalized Additive Models Overview Quantile Regression Applications Poisson and Negative Binomial Regression Robust Regression Techniques Feature Selection via L1 Penalty Bayesian Linear Regression Basics Multitask and Multioutput Regression Nonlinear Regression with Kernels

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Regression II: Regularization and Advanced Techniques

Regression II: Regularization and Advanced Techniques

25590 views

Control complexity and improve generalization using ridge, lasso, elastic net, and specialized regressors.

Content

6 of 15

Cross-Validated Regularization Paths

Cross-Validated Regularization Paths — Sass & Science

4137 views

intermediate

humorous

data-science

visual

gpt-5-mini

4137 views

Versions:

Cross-Validated Regularization Paths — Sass & Science

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Cross‑Validated Regularization Paths — How to pick lambdas like a pro (and still have fun)

"You already learned linear regression and how to diagnose it. Now we regularize, we path-trace, and we cross-validate like we're solving a mystery." — Your slightly dramatic TA

Hook: Why a path and why cross-validation?

Remember when we fit OLS and then whispered, "Hmm, multicollinearity…"? Regularization fixes that. But regularization isn't a single magic number — it's a whole path of solutions indexed by the regularization strength (lambda). A cross‑validated regularization path is simply: compute model solutions across many lambdas and evaluate each via cross‑validation to pick the lambda that actually performs best out of sample.

This builds directly on what you already saw in Coordinate Descent Algorithms (we use warm starts and coordinate-wise updates) and Choosing Regularization Strength (we discussed criteria like min MSE and the 1‑SE rule). Here we glue those pieces together into practical, efficient workflows that scale.

Quick recap (two lines)

From Regression I: make sure predictors are standardized and the intercept handled properly — otherwise the path is meaningless.
From Coordinate Descent: warm starts (use previous lambda's coefficients as initialization) make computing the whole path cheap.

The anatomy of a cross‑validated path

Lambda grid — choose a sequence: lambda_max -> lambda_min on a log scale. Typically lambda_max is the smallest value that zeroes out all coefficients for Lasso; lambda_min is a fraction (e.g., 1e-4) of lambda_max. Use ~50–100 values spaced logarithmically.
Compute path — for each lambda, compute coefficients using coordinate descent (warm starts). Save the model.
Cross‑validate — for each lambda, compute K‑fold CV error (e.g., RMSE for regression). Aggregate errors across folds.
Select lambda — pick lambda_min (min CV error) or lambda_1se (largest lambda within 1 standard error of the min). The 1‑SE rule prefers simpler models.
Refit final model — refit chosen lambda on the full training set (or use the CV‑averaged coefficients if you prefer).

Pseudocode: Efficient CV regularization path (with warm starts and parallel folds)

Input: X, y, model='lasso' or 'elasticnet', alpha (elastic net mixing), K (folds), L (num lambdas)
Standardize X; center y if intercept included.
Compute lambda_max based on data and model type.
Create lambdas = logspace(log(lambda_max), log(lambda_min=ratio*lambda_max), L)

For each fold k in parallel:
  Split X_k_train, y_k_train, X_k_val, y_k_val
  beta_prev = 0  # warm start for this fold
  For lambda in lambdas:
    beta = coordinate_descent(X_k_train, y_k_train, lambda, alpha, init=beta_prev)
    pred = X_k_val @ beta + intercept
    fold_errors[k, lambda] = loss(pred, y_k_val)
    beta_prev = beta  # warm start for next lambda

Aggregate errors across folds: mean_error[lambda], se_error[lambda]
Select lambda_min = argmin mean_error
Select lambda_1se = largest lambda with mean_error <= mean_error[lambda_min] + se_error[lambda_min]
Refit final model on full data with chosen lambda

Notes: warm starts per fold are key — they make each subsequent lambda converge in far fewer iterations.

Visual diagnostics: the CV curve (a must‑see)

Plot mean CV error ± 1 SE vs log(lambda). Add vertical lines at lambda_min and lambda_1se.

Why this plot matters:

You can visually inspect stability: large error swings mean unstable CV — maybe your folds are unbalanced or the dataset is tiny.
If the error curve is very flat near the minimum, choose the larger lambda for simplicity (the 1‑SE rule).

Practical knobs and heuristics (the stuff that saves your compute budget)

Standardize features: always. Regularization penalties are scale‑dependent. If you skipped this in Regression I, go fix it now.
Lambda grid: use log spacing; 50–100 lambdas is typical. glmnet defaults work well (lambda_min_ratio ≈ 1e-4 for p>n, or 1e-2 otherwise).
Warm starts: crucial. Coordinate descent + warm starts = path computation in near-linear time.
Parallelize folds: run each fold's path in parallel, not each lambda. Parallel over folds minimizes communication and uses warm starts per fold.
Use a validation curve, not just one point: check stability (repeated CV or nested CV) if model selection is critical.

Selection rules and overfitting guards

Lambda_min: lowest mean CV error — might overfit slightly.
Lambda_1se: pick the most regularized model within one standard error — great for parsimony.
Nested CV: if you report final test performance after hyperparameter tuning, use nested CV to avoid optimistic bias. Outer folds measure performance; inner folds run the lambda path and choose lambda.
Stability selection: for highly sparse models (Lasso), combine subsampling with selection frequencies to get robust variable selection.

How the different penalties behave along the path (cheat sheet)

Penalty	Path behavior	When to prefer
Ridge (L2)	Smooth shrinkage, coefficients shrink continuously but almost never hit zero	When multicollinearity dominates and you want stable predictions
Lasso (L1)	Produces sparse paths where coefficients hit zero at various lambdas	When you want variable selection and interpretability
Elastic Net	Mixture: combines sparsity and grouping effects, stable with correlated predictors	When predictors are correlated and you want some sparsity

Common pitfalls (and how to avoid them)

Forgetting to standardize — coefficients and lambda become meaningless.
Performing CV after data-driven preprocessing that leaked the test set (e.g., scaling using full data). Always fit scalers inside CV folds.
Reporting CV-selected performance without nested CV — optimistic evaluation.
Too narrow a lambda grid — you might miss the true minimum; too coarse — you miss nuance.

Parting wisdom

"A regularization path is like a movie of your model's life — watch it, don't just pick the ending."

Summary takeaways:

Compute the full regularization path with warm starts and cross‑validation to robustly choose lambda.
Use the CV curve and the 1‑SE rule to balance prediction and simplicity.
Parallelize over folds, standardize features, and consider nested CV for honest performance estimates.

Want a challenge? Try: run a 10x5 repeated CV for an elastic net path, then compare variable selection stability across repeats. If the same predictors light up like a traffic jam every time, you have a reliable signal. If not — maybe your data's noise is the star of the show.

Go forth and path! (And bring snacks.)

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics