jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Supervised Machine Learning: Regression and Classification
Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

Ridge Regression FundamentalsLasso Regression and SparsityElastic Net and Mixing ParameterChoosing Regularization StrengthCoordinate Descent AlgorithmsCross-Validated Regularization PathsPolynomial Regression with RegularizationGeneralized Additive Models OverviewQuantile Regression ApplicationsPoisson and Negative Binomial RegressionRobust Regression TechniquesFeature Selection via L1 PenaltyBayesian Linear Regression BasicsMultitask and Multioutput RegressionNonlinear Regression with Kernels

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Regression II: Regularization and Advanced Techniques

Regression II: Regularization and Advanced Techniques

25570 views

Control complexity and improve generalization using ridge, lasso, elastic net, and specialized regressors.

Content

2 of 15

Lasso Regression and Sparsity

Sassy Sparse Lasso
4137 views
intermediate
humorous
machine learning
gpt-5-mini
4137 views

Versions:

Sassy Sparse Lasso

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Lasso Regression and Sparsity — The No-Nonsense Guide

"If Ridge is the neat gardener trimming the bushes, Lasso is the ruthless landscaper who rips out entire plants. Sometimes your yard needs that." — Your slightly dramatic ML TA


Hook: Why tear features out by the root?

You already know how to build baseline models (dummy regressors) and interpret coefficients from our earlier modules, and you just met Ridge Regression (L2) that gently shrinks coefficients. But what if your model is a hoarder, keeping dozens of tiny, useless features that make interpretation ugly and generalization weak?

Enter Lasso (L1) regression, the regularizer that does more than shrink — it zeros parameters, giving you sparse, interpretable models. If interpretability, feature selection, or model simplicity matters, Lasso is the bouncer who decides which variables get to stay.


What is Lasso? The math, simply stated

At its core, Lasso solves a penalized least-squares problem:

Minimize (1 / (2n)) * ||y - Xβ||_2^2 + λ * ||β||_1
  • The first term is the usual residual sum of squares (fit).
  • The second term is the L1 penalty: the sum of absolute values of coefficients.
  • λ ≥ 0 is the regularization strength. Larger λ → more coefficients forced to zero.

Compare to Ridge: Ridge uses ||β||_2^2 (sum of squares). That shrinks coefficients continuously but rarely makes them exactly zero.


Intuition: Geometry and why Lasso zeros things out

Picture level curves (ellipses) of the least-squares loss and a constraint region for the penalty:

  • Ridge's constraint is a circle/ellipse (L2 ball) — intersections with contours usually produce small but nonzero β.
  • Lasso's constraint is a diamond (L1 ball) with corners on axes — intersections often land on axes, producing zeros.

So the geometry of the penalty causes sparsity.


Why sparsity matters (practical reasons)

  • Interpretability: fewer predictors → easier story to tell. From "model says X and Y matter" to "only X matters".
  • Computation & storage: smaller model can be faster and cheaper (useful for embedded devices).
  • Noise reduction: removing irrelevant features can reduce variance and improve generalization.
  • Feature selection: Lasso does variable selection as part of training — a tidy, built-in selection method.

But it’s not a magic wand. Read on.


How Lasso differs from Ridge — quick comparison

Property Ridge (L2) Lasso (L1) Elastic Net (mix)
Shrinkage vs selection Shrinks, keeps all bets Shrinks and sets many to zero Compromise: can select and shrink
Works well when Many small/collinear effects Few true nonzero coefficients Correlated groups + sparsity
Geometry Smooth ball (no corners) Diamond (corners → zeros) Intermediate shape

Practical considerations & gotchas

  1. Standardize your features — Always. Lasso is scale-sensitive. If one feature has a huge scale, its penalty is effectively smaller. Do: StandardScaler on X before applying Lasso.
  2. λ selection matters — Use cross-validation (e.g., LassoCV) to pick λ. Too big → everything zero. Too small → overfitting.
  3. Correlated predictors — Lasso arbitrarily chooses one from a group of correlated variables and zeros the rest. If you want grouped selection, consider Elastic Net (mix of L1 and L2) or grouped Lasso.
  4. Stability — Lasso feature sets can be unstable under small data perturbations. Bootstrapping can help assess selection stability.
  5. Degrees of freedom & bias — Lasso introduces bias in coefficients (especially large λ). Post-selection OLS (refit unpenalized on selected features) can sometimes reduce bias.

Algorithmic notes (how Lasso is solved)

  • Popular algorithms: coordinate descent (fast and simple), LARS (Least Angle Regression) for entire solution path, and proximal gradient methods.
  • Coordinate descent: freeze all coefficients but one, minimize w.r.t. that coefficient with soft-thresholding, cycle until convergence. Elegant and efficient for high-dimensional data.

Pseudocode (very brief):

Initialize β = 0
Repeat until convergence:
  For j in 1..p:
    r_j = y - X_{-j} β_{-j}   # partial residual
    ρ = (1/n) * X_j^T r_j
    β_j = sign(ρ) * max(|ρ| - λ, 0) / ( (1/n) * ||X_j||_2^2 )

Quick sklearn example

from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(StandardScaler(), LassoCV(cv=5, n_alphas=100))
pipe.fit(X_train, y_train)
coef = pipe.named_steps['lassocv'].coef_
print('Nonzero features:', (coef != 0).sum())

This picks λ by CV and returns a sparse model.


When to use Lasso — a decision checklist

  • Use Lasso when:
    • You suspect only a subset of features are really useful.
    • You want automatic feature selection for interpretability.
    • You have high-dimensional data (p comparable to or exceeds n).
  • Consider other options when:
    • Features are highly correlated → try Elastic Net.
    • You prefer shrinkage but not selection → Ridge may be better.
    • You need stable selection → consider stability selection / bootstrapped Lasso.

Small example (story form)

Imagine you have 200 genomic features. Most are noise, a few matter. Ordinary least squares overfits and gives you a bewildering forest of tiny coefficients. Ridge tames the magnitudes but keeps the forest. Lasso, with a well-chosen λ, removes many trees and leaves you with a few genes to investigate — an experimentalist’s dream.

But if those genes are highly correlated (the biology is messy), Lasso might pick one arbitrary gene from a cluster. Elastic Net can help you pick the whole cluster.


Closing: TL;DR + challenges to try

  • TL;DR: Lasso (L1) = shrink + selection → sparsity and interpretability. Ridge = shrink only. Elastic Net = best-of-both when features are correlated.

Key actions:

  1. Standardize features before regularizing.
  2. Use CV to choose λ (LassoCV).
  3. Check which features are zeroed — are they plausible?
  4. If correlation is high, prefer Elastic Net or group methods.

Final thought: Sparsity is beautiful, but biology/social systems and many real datasets are messy. Treat Lasso’s selections as hypotheses — useful guides, not gospel.


Exercises (try these in your notebook)

  1. Fit OLS, Ridge, Lasso on the same data. Compare test RMSE and number of nonzero coefficients.
  2. Create correlated predictors and observe how Lasso picks among them; then try Elastic Net.
  3. Implement coordinate descent for Lasso on a small dataset (for learning, not speed).
Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics