jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Supervised Machine Learning: Regression and Classification
Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

Ridge Regression FundamentalsLasso Regression and SparsityElastic Net and Mixing ParameterChoosing Regularization StrengthCoordinate Descent AlgorithmsCross-Validated Regularization PathsPolynomial Regression with RegularizationGeneralized Additive Models OverviewQuantile Regression ApplicationsPoisson and Negative Binomial RegressionRobust Regression TechniquesFeature Selection via L1 PenaltyBayesian Linear Regression BasicsMultitask and Multioutput RegressionNonlinear Regression with Kernels

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Regression II: Regularization and Advanced Techniques

Regression II: Regularization and Advanced Techniques

25570 views

Control complexity and improve generalization using ridge, lasso, elastic net, and specialized regressors.

Content

5 of 15

Coordinate Descent Algorithms

Coordinate Descent — Sass & Science
1336 views
intermediate
humorous
visual
science
gpt-5-mini
1336 views

Versions:

Coordinate Descent — Sass & Science

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Coordinate Descent Algorithms — The Lazy Genius of Regularized Regression

"Why run when you can stroll one coordinate at a time and still win the race?" — Your future algorithm, probably.

You already know how linear regression loves its closed-form solution and how regularization (Ridge, Lasso, Elastic Net) makes that solution behave. You've just fine-tuned penalties (lambda) and mixed penalties (alpha) — nice work. Now we ask: when those penalties make life non-smooth (hello, L1) or computationally heavy, how do we actually solve the optimization efficiently? Enter: Coordinate Descent — the deliberately non-greedy, surprisingly fast algorithm that updates one parameter at a time.


What is Coordinate Descent? (Short version)

Coordinate Descent (CD) optimizes a multivariate objective by repeatedly optimizing with respect to a single parameter (coordinate) while holding the others fixed. Think of it as cleaning a messy room one drawer at a time rather than trying to fold everything at once.

  • For squared-error loss with penalties, each coordinate update often has a closed-form (or simple proximal step). That makes CD elegant and extremely practical for Lasso and Elastic Net.

Why prefer Coordinate Descent (vs. gradient methods)?

  • Sparsity-friendly: L1 introduces kinks. Gradient descent struggles with non-differentiability at zero; CD uses soft-thresholding and handles sparsity like a champ.
  • Closed-form cheap updates: For many problems (Lasso), the update for a coefficient is simple and O(n) in the number of samples.
  • Warm starts & path algorithms: CD is ideal for computing regularization paths (varying lambda) — use previous solution as a starting point. You already saw why pathwise approaches help when choosing regularization strength.
  • Memory efficiency: Works well when p (features) is large but each update only references one column at a time.

When to not use CD: if your objective couples all parameters with dense Hessians and no simple coordinate minimizer, or if you need heavy parallelism (naive CD is naturally serial). In those cases, accelerated gradient or second-order methods might win.


The magic formula — Lasso coordinate update

For Lasso (minimize 1/2n ||y - Xβ||^2 + λ||β||_1), after centering y and standardizing columns of X (crucial!), the coordinate update for β_j is:

β_j <- S( (1/n) x_j^T (r + x_j β_j), λ )

where r = y - Xβ is the residual with the current β, and S is the soft-thresholding operator:

S(z, λ) = sign(z) * max(|z| - λ, 0)

In words: compute the unpenalized least-squares coefficient for coordinate j (that dot product), then nudge it toward zero by λ — if it crosses zero, it becomes exactly zero. That spectacularly gives sparsity.


Elastic Net: coordinate update with a tiny twist

Elastic Net (α mixes L1 and L2): objective = 1/2n ||y - Xβ||^2 + λ(α||β||_1 + (1-α)/2 ||β||_2^2)

Coordinate update becomes a scaled soft-threshold:

β_j <- S( (1/n) x_j^T (r + x_j β_j), λα ) / (1 + λ(1-α))

So the L2 part simply shrinks the denominator (like Ridge), and the L1 part gives sparsity via soft-thresholding. This is why CD pairs so well with Elastic Net — you get closed-form, cheap updates, and a smooth handle on both penalties. Remember how we picked α earlier? CD makes those mixed-penalty problems tractable.


Practical algorithm (pathwise coordinate descent)

  1. Standardize X and center y (important!).
  2. Define a decreasing sequence of λ values (start large so β=0 is optimal).
  3. For λ in sequence:
    • Initialize β with previous λ's solution (warm start).
    • Repeat until convergence:
      • For j in 1..p (cyclic) or a randomized order:
        • Compute partial residual r_j = y - Xβ + x_j β_j
        • Update β_j using soft-threshold (or scaled variant for EN)
    • Record β(λ)

Warm starts + pathwise λ are why cross-validated selection of λ is fast. You already learned earlier how to choose λ and α — CD gives a practical way to compute the solutions for those choices.


Implementation tips & tricks

  • Standardize features: Otherwise the coordinate updates have different scales and λ acts weirdly. If you don't standardize, account for column norms in the denominator.
  • Center y: removes intercept from CD updates; handle intercept separately.
  • Warm starts: huge speed-up for λ paths.
  • Active set cycling: often most βs become zero; iterate primarily on the non-zero set and occasionally check the zeros — this is a common performance hack.
  • Screening rules: use strong/SAFE rules to drop predictors before solving — less work.
  • Randomized vs cyclic: randomized can have better theoretical guarantees in some settings; cyclic is simple and often fast in practice.
  • Convergence tolerance: be careful — tiny tolerances are slow; for CV you can tolerate looser convergence.
  • Parallelism: naive CD is sequential. You can parallelize blocks of coordinates that are nearly orthogonal or use asynchronous updates, but that's more advanced and tricky.

Pseudocode (compact)

Standardize X, center y
for λ in lambda_sequence:
  β <- β_prev  # warm start
  repeat:
    for j in 1..p:
      r_j = y - Xβ + x_j*β_j
      z = (1/n) * x_j.T @ r_j
      β_j = soft_threshold(z, λ*α) / (1 + λ*(1-α))
  until convergence
  store β

Quick comparative table

Method Best for Strengths Weaknesses
Coordinate Descent Lasso, Elastic Net, sparse solutions Simple updates, warm starts, sparsity, pathwise solutions Mostly serial, needs standardization, non-trivial for non-separable losses
Gradient-based methods Smooth convex problems, large-scale dense data Easy parallelization, good theoretical rates Struggle with L1 kinks unless proximal variant used
Proximal/Accelerated methods Large-scale convex with non-smooth term Fast convergence, handles composite probs Requires proximal operator; sometimes heavier per-iteration cost

Final takeaways (so you remember tomorrow)

  • Coordinate Descent is the go-to solver for Lasso and Elastic Net because it turns messy, non-differentiable penalties into simple, fast per-coordinate updates.
  • Standardize, warm-start, and use pathwise λ sequences — that combo is performance gold.
  • CD gives you sparsity literally for free via soft-thresholding. If you're trying to select variables or compute regularization paths (remember choosing λ and mixing α?), CD is the pragmatic hero.

Bonus life lesson: sometimes solving a hard problem by focusing on one tiny thing at a time is not laziness — it’s strategy.

Now go run a coordinate descent implementation, watch predictors drop to zero like dominoes, and feel a smug sense of efficiency. Need a short NumPy/Scikit-Learn code example next? I can serve you a spicy snippet with comments and caffeine.

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics