Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

Ridge Regression Fundamentals Lasso Regression and Sparsity Elastic Net and Mixing Parameter Choosing Regularization Strength Coordinate Descent Algorithms Cross-Validated Regularization Paths Polynomial Regression with Regularization Generalized Additive Models Overview Quantile Regression Applications Poisson and Negative Binomial Regression Robust Regression Techniques Feature Selection via L1 Penalty Bayesian Linear Regression Basics Multitask and Multioutput Regression Nonlinear Regression with Kernels

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Regression II: Regularization and Advanced Techniques

Regression II: Regularization and Advanced Techniques

25590 views

Control complexity and improve generalization using ridge, lasso, elastic net, and specialized regressors.

Content

5 of 15

Coordinate Descent Algorithms

Coordinate Descent — Sass & Science

1337 views

intermediate

humorous

visual

science

gpt-5-mini

1337 views

Versions:

Coordinate Descent — Sass & Science

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Coordinate Descent Algorithms — The Lazy Genius of Regularized Regression

"Why run when you can stroll one coordinate at a time and still win the race?" — Your future algorithm, probably.

You already know how linear regression loves its closed-form solution and how regularization (Ridge, Lasso, Elastic Net) makes that solution behave. You've just fine-tuned penalties (lambda) and mixed penalties (alpha) — nice work. Now we ask: when those penalties make life non-smooth (hello, L1) or computationally heavy, how do we actually solve the optimization efficiently? Enter: Coordinate Descent — the deliberately non-greedy, surprisingly fast algorithm that updates one parameter at a time.

What is Coordinate Descent? (Short version)

Coordinate Descent (CD) optimizes a multivariate objective by repeatedly optimizing with respect to a single parameter (coordinate) while holding the others fixed. Think of it as cleaning a messy room one drawer at a time rather than trying to fold everything at once.

For squared-error loss with penalties, each coordinate update often has a closed-form (or simple proximal step). That makes CD elegant and extremely practical for Lasso and Elastic Net.

Why prefer Coordinate Descent (vs. gradient methods)?

Sparsity-friendly: L1 introduces kinks. Gradient descent struggles with non-differentiability at zero; CD uses soft-thresholding and handles sparsity like a champ.
Closed-form cheap updates: For many problems (Lasso), the update for a coefficient is simple and O(n) in the number of samples.
Warm starts & path algorithms: CD is ideal for computing regularization paths (varying lambda) — use previous solution as a starting point. You already saw why pathwise approaches help when choosing regularization strength.
Memory efficiency: Works well when p (features) is large but each update only references one column at a time.

When to not use CD: if your objective couples all parameters with dense Hessians and no simple coordinate minimizer, or if you need heavy parallelism (naive CD is naturally serial). In those cases, accelerated gradient or second-order methods might win.

The magic formula — Lasso coordinate update

For Lasso (minimize 1/2n ||y - Xβ||^2 + λ||β||_1), after centering y and standardizing columns of X (crucial!), the coordinate update for β_j is:

β_j <- S( (1/n) x_j^T (r + x_j β_j), λ )

where r = y - Xβ is the residual with the current β, and S is the soft-thresholding operator:

S(z, λ) = sign(z) * max(|z| - λ, 0)

In words: compute the unpenalized least-squares coefficient for coordinate j (that dot product), then nudge it toward zero by λ — if it crosses zero, it becomes exactly zero. That spectacularly gives sparsity.

Elastic Net: coordinate update with a tiny twist

Elastic Net (α mixes L1 and L2): objective = 1/2n ||y - Xβ||^2 + λ(α||β||_1 + (1-α)/2 ||β||_2^2)

Coordinate update becomes a scaled soft-threshold:

β_j <- S( (1/n) x_j^T (r + x_j β_j), λα ) / (1 + λ(1-α))

So the L2 part simply shrinks the denominator (like Ridge), and the L1 part gives sparsity via soft-thresholding. This is why CD pairs so well with Elastic Net — you get closed-form, cheap updates, and a smooth handle on both penalties. Remember how we picked α earlier? CD makes those mixed-penalty problems tractable.

Practical algorithm (pathwise coordinate descent)

Standardize X and center y (important!).
Define a decreasing sequence of λ values (start large so β=0 is optimal).
For λ in sequence:
- Initialize β with previous λ's solution (warm start).
- Repeat until convergence:
  - For j in 1..p (cyclic) or a randomized order:
    - Compute partial residual r_j = y - Xβ + x_j β_j
    - Update β_j using soft-threshold (or scaled variant for EN)
- Record β(λ)

Warm starts + pathwise λ are why cross-validated selection of λ is fast. You already learned earlier how to choose λ and α — CD gives a practical way to compute the solutions for those choices.

Implementation tips & tricks

Standardize features: Otherwise the coordinate updates have different scales and λ acts weirdly. If you don't standardize, account for column norms in the denominator.
Center y: removes intercept from CD updates; handle intercept separately.
Warm starts: huge speed-up for λ paths.
Active set cycling: often most βs become zero; iterate primarily on the non-zero set and occasionally check the zeros — this is a common performance hack.
Screening rules: use strong/SAFE rules to drop predictors before solving — less work.
Randomized vs cyclic: randomized can have better theoretical guarantees in some settings; cyclic is simple and often fast in practice.
Convergence tolerance: be careful — tiny tolerances are slow; for CV you can tolerate looser convergence.
Parallelism: naive CD is sequential. You can parallelize blocks of coordinates that are nearly orthogonal or use asynchronous updates, but that's more advanced and tricky.

Pseudocode (compact)

Standardize X, center y
for λ in lambda_sequence:
  β <- β_prev  # warm start
  repeat:
    for j in 1..p:
      r_j = y - Xβ + x_j*β_j
      z = (1/n) * x_j.T @ r_j
      β_j = soft_threshold(z, λ*α) / (1 + λ*(1-α))
  until convergence
  store β

Quick comparative table

Method	Best for	Strengths	Weaknesses
Coordinate Descent	Lasso, Elastic Net, sparse solutions	Simple updates, warm starts, sparsity, pathwise solutions	Mostly serial, needs standardization, non-trivial for non-separable losses
Gradient-based methods	Smooth convex problems, large-scale dense data	Easy parallelization, good theoretical rates	Struggle with L1 kinks unless proximal variant used
Proximal/Accelerated methods	Large-scale convex with non-smooth term	Fast convergence, handles composite probs	Requires proximal operator; sometimes heavier per-iteration cost

Final takeaways (so you remember tomorrow)

Coordinate Descent is the go-to solver for Lasso and Elastic Net because it turns messy, non-differentiable penalties into simple, fast per-coordinate updates.
Standardize, warm-start, and use pathwise λ sequences — that combo is performance gold.
CD gives you sparsity literally for free via soft-thresholding. If you're trying to select variables or compute regularization paths (remember choosing λ and mixing α?), CD is the pragmatic hero.

Bonus life lesson: sometimes solving a hard problem by focusing on one tiny thing at a time is not laziness — it’s strategy.

Now go run a coordinate descent implementation, watch predictors drop to zero like dominoes, and feel a smug sense of efficiency. Need a short NumPy/Scikit-Learn code example next? I can serve you a spicy snippet with comments and caffeine.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics