Regression II: Regularization and Advanced Techniques
Control complexity and improve generalization using ridge, lasso, elastic net, and specialized regressors.
Content
Coordinate Descent Algorithms
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Coordinate Descent Algorithms — The Lazy Genius of Regularized Regression
"Why run when you can stroll one coordinate at a time and still win the race?" — Your future algorithm, probably.
You already know how linear regression loves its closed-form solution and how regularization (Ridge, Lasso, Elastic Net) makes that solution behave. You've just fine-tuned penalties (lambda) and mixed penalties (alpha) — nice work. Now we ask: when those penalties make life non-smooth (hello, L1) or computationally heavy, how do we actually solve the optimization efficiently? Enter: Coordinate Descent — the deliberately non-greedy, surprisingly fast algorithm that updates one parameter at a time.
What is Coordinate Descent? (Short version)
Coordinate Descent (CD) optimizes a multivariate objective by repeatedly optimizing with respect to a single parameter (coordinate) while holding the others fixed. Think of it as cleaning a messy room one drawer at a time rather than trying to fold everything at once.
- For squared-error loss with penalties, each coordinate update often has a closed-form (or simple proximal step). That makes CD elegant and extremely practical for Lasso and Elastic Net.
Why prefer Coordinate Descent (vs. gradient methods)?
- Sparsity-friendly: L1 introduces kinks. Gradient descent struggles with non-differentiability at zero; CD uses soft-thresholding and handles sparsity like a champ.
- Closed-form cheap updates: For many problems (Lasso), the update for a coefficient is simple and O(n) in the number of samples.
- Warm starts & path algorithms: CD is ideal for computing regularization paths (varying lambda) — use previous solution as a starting point. You already saw why pathwise approaches help when choosing regularization strength.
- Memory efficiency: Works well when p (features) is large but each update only references one column at a time.
When to not use CD: if your objective couples all parameters with dense Hessians and no simple coordinate minimizer, or if you need heavy parallelism (naive CD is naturally serial). In those cases, accelerated gradient or second-order methods might win.
The magic formula — Lasso coordinate update
For Lasso (minimize 1/2n ||y - Xβ||^2 + λ||β||_1), after centering y and standardizing columns of X (crucial!), the coordinate update for β_j is:
β_j <- S( (1/n) x_j^T (r + x_j β_j), λ )
where r = y - Xβ is the residual with the current β, and S is the soft-thresholding operator:
S(z, λ) = sign(z) * max(|z| - λ, 0)
In words: compute the unpenalized least-squares coefficient for coordinate j (that dot product), then nudge it toward zero by λ — if it crosses zero, it becomes exactly zero. That spectacularly gives sparsity.
Elastic Net: coordinate update with a tiny twist
Elastic Net (α mixes L1 and L2): objective = 1/2n ||y - Xβ||^2 + λ(α||β||_1 + (1-α)/2 ||β||_2^2)
Coordinate update becomes a scaled soft-threshold:
β_j <- S( (1/n) x_j^T (r + x_j β_j), λα ) / (1 + λ(1-α))
So the L2 part simply shrinks the denominator (like Ridge), and the L1 part gives sparsity via soft-thresholding. This is why CD pairs so well with Elastic Net — you get closed-form, cheap updates, and a smooth handle on both penalties. Remember how we picked α earlier? CD makes those mixed-penalty problems tractable.
Practical algorithm (pathwise coordinate descent)
- Standardize X and center y (important!).
- Define a decreasing sequence of λ values (start large so β=0 is optimal).
- For λ in sequence:
- Initialize β with previous λ's solution (warm start).
- Repeat until convergence:
- For j in 1..p (cyclic) or a randomized order:
- Compute partial residual r_j = y - Xβ + x_j β_j
- Update β_j using soft-threshold (or scaled variant for EN)
- For j in 1..p (cyclic) or a randomized order:
- Record β(λ)
Warm starts + pathwise λ are why cross-validated selection of λ is fast. You already learned earlier how to choose λ and α — CD gives a practical way to compute the solutions for those choices.
Implementation tips & tricks
- Standardize features: Otherwise the coordinate updates have different scales and λ acts weirdly. If you don't standardize, account for column norms in the denominator.
- Center y: removes intercept from CD updates; handle intercept separately.
- Warm starts: huge speed-up for λ paths.
- Active set cycling: often most βs become zero; iterate primarily on the non-zero set and occasionally check the zeros — this is a common performance hack.
- Screening rules: use strong/SAFE rules to drop predictors before solving — less work.
- Randomized vs cyclic: randomized can have better theoretical guarantees in some settings; cyclic is simple and often fast in practice.
- Convergence tolerance: be careful — tiny tolerances are slow; for CV you can tolerate looser convergence.
- Parallelism: naive CD is sequential. You can parallelize blocks of coordinates that are nearly orthogonal or use asynchronous updates, but that's more advanced and tricky.
Pseudocode (compact)
Standardize X, center y
for λ in lambda_sequence:
β <- β_prev # warm start
repeat:
for j in 1..p:
r_j = y - Xβ + x_j*β_j
z = (1/n) * x_j.T @ r_j
β_j = soft_threshold(z, λ*α) / (1 + λ*(1-α))
until convergence
store β
Quick comparative table
| Method | Best for | Strengths | Weaknesses |
|---|---|---|---|
| Coordinate Descent | Lasso, Elastic Net, sparse solutions | Simple updates, warm starts, sparsity, pathwise solutions | Mostly serial, needs standardization, non-trivial for non-separable losses |
| Gradient-based methods | Smooth convex problems, large-scale dense data | Easy parallelization, good theoretical rates | Struggle with L1 kinks unless proximal variant used |
| Proximal/Accelerated methods | Large-scale convex with non-smooth term | Fast convergence, handles composite probs | Requires proximal operator; sometimes heavier per-iteration cost |
Final takeaways (so you remember tomorrow)
- Coordinate Descent is the go-to solver for Lasso and Elastic Net because it turns messy, non-differentiable penalties into simple, fast per-coordinate updates.
- Standardize, warm-start, and use pathwise λ sequences — that combo is performance gold.
- CD gives you sparsity literally for free via soft-thresholding. If you're trying to select variables or compute regularization paths (remember choosing λ and mixing α?), CD is the pragmatic hero.
Bonus life lesson: sometimes solving a hard problem by focusing on one tiny thing at a time is not laziness — it’s strategy.
Now go run a coordinate descent implementation, watch predictors drop to zero like dominoes, and feel a smug sense of efficiency. Need a short NumPy/Scikit-Learn code example next? I can serve you a spicy snippet with comments and caffeine.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!