Regression II: Regularization and Advanced Techniques
Control complexity and improve generalization using ridge, lasso, elastic net, and specialized regressors.
Content
Cross-Validated Regularization Paths
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Cross‑Validated Regularization Paths — How to pick lambdas like a pro (and still have fun)
"You already learned linear regression and how to diagnose it. Now we regularize, we path-trace, and we cross-validate like we're solving a mystery." — Your slightly dramatic TA
Hook: Why a path and why cross-validation?
Remember when we fit OLS and then whispered, "Hmm, multicollinearity…"? Regularization fixes that. But regularization isn't a single magic number — it's a whole path of solutions indexed by the regularization strength (lambda). A cross‑validated regularization path is simply: compute model solutions across many lambdas and evaluate each via cross‑validation to pick the lambda that actually performs best out of sample.
This builds directly on what you already saw in Coordinate Descent Algorithms (we use warm starts and coordinate-wise updates) and Choosing Regularization Strength (we discussed criteria like min MSE and the 1‑SE rule). Here we glue those pieces together into practical, efficient workflows that scale.
Quick recap (two lines)
- From Regression I: make sure predictors are standardized and the intercept handled properly — otherwise the path is meaningless.
- From Coordinate Descent: warm starts (use previous lambda's coefficients as initialization) make computing the whole path cheap.
The anatomy of a cross‑validated path
- Lambda grid — choose a sequence: lambda_max -> lambda_min on a log scale. Typically lambda_max is the smallest value that zeroes out all coefficients for Lasso; lambda_min is a fraction (e.g., 1e-4) of lambda_max. Use ~50–100 values spaced logarithmically.
- Compute path — for each lambda, compute coefficients using coordinate descent (warm starts). Save the model.
- Cross‑validate — for each lambda, compute K‑fold CV error (e.g., RMSE for regression). Aggregate errors across folds.
- Select lambda — pick lambda_min (min CV error) or lambda_1se (largest lambda within 1 standard error of the min). The 1‑SE rule prefers simpler models.
- Refit final model — refit chosen lambda on the full training set (or use the CV‑averaged coefficients if you prefer).
Pseudocode: Efficient CV regularization path (with warm starts and parallel folds)
Input: X, y, model='lasso' or 'elasticnet', alpha (elastic net mixing), K (folds), L (num lambdas)
Standardize X; center y if intercept included.
Compute lambda_max based on data and model type.
Create lambdas = logspace(log(lambda_max), log(lambda_min=ratio*lambda_max), L)
For each fold k in parallel:
Split X_k_train, y_k_train, X_k_val, y_k_val
beta_prev = 0 # warm start for this fold
For lambda in lambdas:
beta = coordinate_descent(X_k_train, y_k_train, lambda, alpha, init=beta_prev)
pred = X_k_val @ beta + intercept
fold_errors[k, lambda] = loss(pred, y_k_val)
beta_prev = beta # warm start for next lambda
Aggregate errors across folds: mean_error[lambda], se_error[lambda]
Select lambda_min = argmin mean_error
Select lambda_1se = largest lambda with mean_error <= mean_error[lambda_min] + se_error[lambda_min]
Refit final model on full data with chosen lambda
Notes: warm starts per fold are key — they make each subsequent lambda converge in far fewer iterations.
Visual diagnostics: the CV curve (a must‑see)
Plot mean CV error ± 1 SE vs log(lambda). Add vertical lines at lambda_min and lambda_1se.
Why this plot matters:
- You can visually inspect stability: large error swings mean unstable CV — maybe your folds are unbalanced or the dataset is tiny.
- If the error curve is very flat near the minimum, choose the larger lambda for simplicity (the 1‑SE rule).
Practical knobs and heuristics (the stuff that saves your compute budget)
- Standardize features: always. Regularization penalties are scale‑dependent. If you skipped this in Regression I, go fix it now.
- Lambda grid: use log spacing; 50–100 lambdas is typical. glmnet defaults work well (lambda_min_ratio ≈ 1e-4 for p>n, or 1e-2 otherwise).
- Warm starts: crucial. Coordinate descent + warm starts = path computation in near-linear time.
- Parallelize folds: run each fold's path in parallel, not each lambda. Parallel over folds minimizes communication and uses warm starts per fold.
- Use a validation curve, not just one point: check stability (repeated CV or nested CV) if model selection is critical.
Selection rules and overfitting guards
- Lambda_min: lowest mean CV error — might overfit slightly.
- Lambda_1se: pick the most regularized model within one standard error — great for parsimony.
- Nested CV: if you report final test performance after hyperparameter tuning, use nested CV to avoid optimistic bias. Outer folds measure performance; inner folds run the lambda path and choose lambda.
- Stability selection: for highly sparse models (Lasso), combine subsampling with selection frequencies to get robust variable selection.
How the different penalties behave along the path (cheat sheet)
| Penalty | Path behavior | When to prefer |
|---|---|---|
| Ridge (L2) | Smooth shrinkage, coefficients shrink continuously but almost never hit zero | When multicollinearity dominates and you want stable predictions |
| Lasso (L1) | Produces sparse paths where coefficients hit zero at various lambdas | When you want variable selection and interpretability |
| Elastic Net | Mixture: combines sparsity and grouping effects, stable with correlated predictors | When predictors are correlated and you want some sparsity |
Common pitfalls (and how to avoid them)
- Forgetting to standardize — coefficients and lambda become meaningless.
- Performing CV after data-driven preprocessing that leaked the test set (e.g., scaling using full data). Always fit scalers inside CV folds.
- Reporting CV-selected performance without nested CV — optimistic evaluation.
- Too narrow a lambda grid — you might miss the true minimum; too coarse — you miss nuance.
Parting wisdom
"A regularization path is like a movie of your model's life — watch it, don't just pick the ending."
Summary takeaways:
- Compute the full regularization path with warm starts and cross‑validation to robustly choose lambda.
- Use the CV curve and the 1‑SE rule to balance prediction and simplicity.
- Parallelize over folds, standardize features, and consider nested CV for honest performance estimates.
Want a challenge? Try: run a 10x5 repeated CV for an elastic net path, then compare variable selection stability across repeats. If the same predictors light up like a traffic jam every time, you have a reliable signal. If not — maybe your data's noise is the star of the show.
Go forth and path! (And bring snacks.)
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!