Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

Ridge Regression Fundamentals Lasso Regression and Sparsity Elastic Net and Mixing Parameter Choosing Regularization Strength Coordinate Descent Algorithms Cross-Validated Regularization Paths Polynomial Regression with Regularization Generalized Additive Models Overview Quantile Regression Applications Poisson and Negative Binomial Regression Robust Regression Techniques Feature Selection via L1 Penalty Bayesian Linear Regression Basics Multitask and Multioutput Regression Nonlinear Regression with Kernels

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Regression II: Regularization and Advanced Techniques

Regression II: Regularization and Advanced Techniques

25590 views

Control complexity and improve generalization using ridge, lasso, elastic net, and specialized regressors.

Content

2 of 15

Lasso Regression and Sparsity

Sassy Sparse Lasso

4137 views

intermediate

humorous

machine learning

gpt-5-mini

4137 views

Versions:

Sassy Sparse Lasso

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Lasso Regression and Sparsity — The No-Nonsense Guide

"If Ridge is the neat gardener trimming the bushes, Lasso is the ruthless landscaper who rips out entire plants. Sometimes your yard needs that." — Your slightly dramatic ML TA

Hook: Why tear features out by the root?

You already know how to build baseline models (dummy regressors) and interpret coefficients from our earlier modules, and you just met Ridge Regression (L2) that gently shrinks coefficients. But what if your model is a hoarder, keeping dozens of tiny, useless features that make interpretation ugly and generalization weak?

Enter Lasso (L1) regression, the regularizer that does more than shrink — it zeros parameters, giving you sparse, interpretable models. If interpretability, feature selection, or model simplicity matters, Lasso is the bouncer who decides which variables get to stay.

What is Lasso? The math, simply stated

At its core, Lasso solves a penalized least-squares problem:

Minimize (1 / (2n)) * ||y - Xβ||_2^2 + λ * ||β||_1

The first term is the usual residual sum of squares (fit).
The second term is the L1 penalty: the sum of absolute values of coefficients.
λ ≥ 0 is the regularization strength. Larger λ → more coefficients forced to zero.

Compare to Ridge: Ridge uses ||β||_2^2 (sum of squares). That shrinks coefficients continuously but rarely makes them exactly zero.

Intuition: Geometry and why Lasso zeros things out

Picture level curves (ellipses) of the least-squares loss and a constraint region for the penalty:

Ridge's constraint is a circle/ellipse (L2 ball) — intersections with contours usually produce small but nonzero β.
Lasso's constraint is a diamond (L1 ball) with corners on axes — intersections often land on axes, producing zeros.

So the geometry of the penalty causes sparsity.

Why sparsity matters (practical reasons)

Interpretability: fewer predictors → easier story to tell. From "model says X and Y matter" to "only X matters".
Computation & storage: smaller model can be faster and cheaper (useful for embedded devices).
Noise reduction: removing irrelevant features can reduce variance and improve generalization.
Feature selection: Lasso does variable selection as part of training — a tidy, built-in selection method.

But it’s not a magic wand. Read on.

How Lasso differs from Ridge — quick comparison

Property	Ridge (L2)	Lasso (L1)	Elastic Net (mix)
Shrinkage vs selection	Shrinks, keeps all bets	Shrinks and sets many to zero	Compromise: can select and shrink
Works well when	Many small/collinear effects	Few true nonzero coefficients	Correlated groups + sparsity
Geometry	Smooth ball (no corners)	Diamond (corners → zeros)	Intermediate shape

Practical considerations & gotchas

Standardize your features — Always. Lasso is scale-sensitive. If one feature has a huge scale, its penalty is effectively smaller. Do: StandardScaler on X before applying Lasso.
λ selection matters — Use cross-validation (e.g., LassoCV) to pick λ. Too big → everything zero. Too small → overfitting.
Correlated predictors — Lasso arbitrarily chooses one from a group of correlated variables and zeros the rest. If you want grouped selection, consider Elastic Net (mix of L1 and L2) or grouped Lasso.
Stability — Lasso feature sets can be unstable under small data perturbations. Bootstrapping can help assess selection stability.
Degrees of freedom & bias — Lasso introduces bias in coefficients (especially large λ). Post-selection OLS (refit unpenalized on selected features) can sometimes reduce bias.

Algorithmic notes (how Lasso is solved)

Popular algorithms: coordinate descent (fast and simple), LARS (Least Angle Regression) for entire solution path, and proximal gradient methods.
Coordinate descent: freeze all coefficients but one, minimize w.r.t. that coefficient with soft-thresholding, cycle until convergence. Elegant and efficient for high-dimensional data.

Pseudocode (very brief):

Initialize β = 0
Repeat until convergence:
  For j in 1..p:
    r_j = y - X_{-j} β_{-j}   # partial residual
    ρ = (1/n) * X_j^T r_j
    β_j = sign(ρ) * max(|ρ| - λ, 0) / ( (1/n) * ||X_j||_2^2 )

Quick sklearn example

from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(StandardScaler(), LassoCV(cv=5, n_alphas=100))
pipe.fit(X_train, y_train)
coef = pipe.named_steps['lassocv'].coef_
print('Nonzero features:', (coef != 0).sum())

This picks λ by CV and returns a sparse model.

When to use Lasso — a decision checklist

Use Lasso when:
- You suspect only a subset of features are really useful.
- You want automatic feature selection for interpretability.
- You have high-dimensional data (p comparable to or exceeds n).
Consider other options when:
- Features are highly correlated → try Elastic Net.
- You prefer shrinkage but not selection → Ridge may be better.
- You need stable selection → consider stability selection / bootstrapped Lasso.

Small example (story form)

Imagine you have 200 genomic features. Most are noise, a few matter. Ordinary least squares overfits and gives you a bewildering forest of tiny coefficients. Ridge tames the magnitudes but keeps the forest. Lasso, with a well-chosen λ, removes many trees and leaves you with a few genes to investigate — an experimentalist’s dream.

But if those genes are highly correlated (the biology is messy), Lasso might pick one arbitrary gene from a cluster. Elastic Net can help you pick the whole cluster.

Closing: TL;DR + challenges to try

TL;DR: Lasso (L1) = shrink + selection → sparsity and interpretability. Ridge = shrink only. Elastic Net = best-of-both when features are correlated.

Key actions:

Standardize features before regularizing.
Use CV to choose λ (LassoCV).
Check which features are zeroed — are they plausible?
If correlation is high, prefer Elastic Net or group methods.

Final thought: Sparsity is beautiful, but biology/social systems and many real datasets are messy. Treat Lasso’s selections as hypotheses — useful guides, not gospel.

Exercises (try these in your notebook)

Fit OLS, Ridge, Lasso on the same data. Compare test RMSE and number of nonzero coefficients.
Create correlated predictors and observe how Lasso picks among them; then try Elastic Net.
Implement coordinate descent for Lasso on a small dataset (for learning, not speed).

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics