Regression II: Regularization and Advanced Techniques
Control complexity and improve generalization using ridge, lasso, elastic net, and specialized regressors.
Content
Lasso Regression and Sparsity
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Lasso Regression and Sparsity — The No-Nonsense Guide
"If Ridge is the neat gardener trimming the bushes, Lasso is the ruthless landscaper who rips out entire plants. Sometimes your yard needs that." — Your slightly dramatic ML TA
Hook: Why tear features out by the root?
You already know how to build baseline models (dummy regressors) and interpret coefficients from our earlier modules, and you just met Ridge Regression (L2) that gently shrinks coefficients. But what if your model is a hoarder, keeping dozens of tiny, useless features that make interpretation ugly and generalization weak?
Enter Lasso (L1) regression, the regularizer that does more than shrink — it zeros parameters, giving you sparse, interpretable models. If interpretability, feature selection, or model simplicity matters, Lasso is the bouncer who decides which variables get to stay.
What is Lasso? The math, simply stated
At its core, Lasso solves a penalized least-squares problem:
Minimize (1 / (2n)) * ||y - Xβ||_2^2 + λ * ||β||_1
- The first term is the usual residual sum of squares (fit).
- The second term is the L1 penalty: the sum of absolute values of coefficients.
- λ ≥ 0 is the regularization strength. Larger λ → more coefficients forced to zero.
Compare to Ridge: Ridge uses ||β||_2^2 (sum of squares). That shrinks coefficients continuously but rarely makes them exactly zero.
Intuition: Geometry and why Lasso zeros things out
Picture level curves (ellipses) of the least-squares loss and a constraint region for the penalty:
- Ridge's constraint is a circle/ellipse (L2 ball) — intersections with contours usually produce small but nonzero β.
- Lasso's constraint is a diamond (L1 ball) with corners on axes — intersections often land on axes, producing zeros.
So the geometry of the penalty causes sparsity.
Why sparsity matters (practical reasons)
- Interpretability: fewer predictors → easier story to tell. From "model says X and Y matter" to "only X matters".
- Computation & storage: smaller model can be faster and cheaper (useful for embedded devices).
- Noise reduction: removing irrelevant features can reduce variance and improve generalization.
- Feature selection: Lasso does variable selection as part of training — a tidy, built-in selection method.
But it’s not a magic wand. Read on.
How Lasso differs from Ridge — quick comparison
| Property | Ridge (L2) | Lasso (L1) | Elastic Net (mix) |
|---|---|---|---|
| Shrinkage vs selection | Shrinks, keeps all bets | Shrinks and sets many to zero | Compromise: can select and shrink |
| Works well when | Many small/collinear effects | Few true nonzero coefficients | Correlated groups + sparsity |
| Geometry | Smooth ball (no corners) | Diamond (corners → zeros) | Intermediate shape |
Practical considerations & gotchas
- Standardize your features — Always. Lasso is scale-sensitive. If one feature has a huge scale, its penalty is effectively smaller. Do: StandardScaler on X before applying Lasso.
- λ selection matters — Use cross-validation (e.g., LassoCV) to pick λ. Too big → everything zero. Too small → overfitting.
- Correlated predictors — Lasso arbitrarily chooses one from a group of correlated variables and zeros the rest. If you want grouped selection, consider Elastic Net (mix of L1 and L2) or grouped Lasso.
- Stability — Lasso feature sets can be unstable under small data perturbations. Bootstrapping can help assess selection stability.
- Degrees of freedom & bias — Lasso introduces bias in coefficients (especially large λ). Post-selection OLS (refit unpenalized on selected features) can sometimes reduce bias.
Algorithmic notes (how Lasso is solved)
- Popular algorithms: coordinate descent (fast and simple), LARS (Least Angle Regression) for entire solution path, and proximal gradient methods.
- Coordinate descent: freeze all coefficients but one, minimize w.r.t. that coefficient with soft-thresholding, cycle until convergence. Elegant and efficient for high-dimensional data.
Pseudocode (very brief):
Initialize β = 0
Repeat until convergence:
For j in 1..p:
r_j = y - X_{-j} β_{-j} # partial residual
ρ = (1/n) * X_j^T r_j
β_j = sign(ρ) * max(|ρ| - λ, 0) / ( (1/n) * ||X_j||_2^2 )
Quick sklearn example
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(StandardScaler(), LassoCV(cv=5, n_alphas=100))
pipe.fit(X_train, y_train)
coef = pipe.named_steps['lassocv'].coef_
print('Nonzero features:', (coef != 0).sum())
This picks λ by CV and returns a sparse model.
When to use Lasso — a decision checklist
- Use Lasso when:
- You suspect only a subset of features are really useful.
- You want automatic feature selection for interpretability.
- You have high-dimensional data (p comparable to or exceeds n).
- Consider other options when:
- Features are highly correlated → try Elastic Net.
- You prefer shrinkage but not selection → Ridge may be better.
- You need stable selection → consider stability selection / bootstrapped Lasso.
Small example (story form)
Imagine you have 200 genomic features. Most are noise, a few matter. Ordinary least squares overfits and gives you a bewildering forest of tiny coefficients. Ridge tames the magnitudes but keeps the forest. Lasso, with a well-chosen λ, removes many trees and leaves you with a few genes to investigate — an experimentalist’s dream.
But if those genes are highly correlated (the biology is messy), Lasso might pick one arbitrary gene from a cluster. Elastic Net can help you pick the whole cluster.
Closing: TL;DR + challenges to try
- TL;DR: Lasso (L1) = shrink + selection → sparsity and interpretability. Ridge = shrink only. Elastic Net = best-of-both when features are correlated.
Key actions:
- Standardize features before regularizing.
- Use CV to choose λ (LassoCV).
- Check which features are zeroed — are they plausible?
- If correlation is high, prefer Elastic Net or group methods.
Final thought: Sparsity is beautiful, but biology/social systems and many real datasets are messy. Treat Lasso’s selections as hypotheses — useful guides, not gospel.
Exercises (try these in your notebook)
- Fit OLS, Ridge, Lasso on the same data. Compare test RMSE and number of nonzero coefficients.
- Create correlated predictors and observe how Lasso picks among them; then try Elastic Net.
- Implement coordinate descent for Lasso on a small dataset (for learning, not speed).
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!