Classification I: Logistic Regression and Probabilistic View
Model class probabilities with logistic regression and related probabilistic classifiers.
Content
Regularized Logistic Regression
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Regularized Logistic Regression: Taming Coefs Without Losing Your Mind
"Logistic regression is simple — until your features start auditioning for a reality show and overfitting the training set."
You already saw logistic regression through the probabilistic lens (logit link) and learned to fit it via maximum likelihood. You also met regularization in regression land (Ridge, Lasso, Elastic Net). Now we combine those powers: regularized logistic regression. Same elegant probabilistic model; now with a leash on the parameters so they don't go wild and ruin generalization.
Why regularize logistic regression?
- Because MLE loves to overfit. Maximum likelihood estimation maximizes the training log-likelihood. If you have lots of features, noisy predictors, or near-separable classes, MLE will happily inflate coefficients to fit noise.
- Because coefficients can blow up in separation. When classes are (nearly) linearly separable, MLE may not converge — coefficients diverge to infinity. Regularization saves the optimizer and the model.
- Because we want stable, generalizable decision boundaries. Regularization penalizes complex models and often improves out-of-sample performance.
Think of regularization as a responsible friend who says, "Maybe don’t let your coefficients go all-in on that suspicious-looking feature."
Formulations: Penalized likelihood and constrained view
Starting from negative log-likelihood (NLL) for logistic regression:
NLL(w) = - sum_{i=1}^n [ y_i log p(x_i) + (1-y_i) log(1-p(x_i)) ]
where p(x) = sigma(w^T x + b)
Penalized (regularized) objective — minimize:
J(w) = NLL(w) + lambda * R(w)
Common choices:
- L2 (Ridge): R(w) = 1/2 * ||w||_2^2 => smooth shrinkage
- L1 (Lasso): R(w) = ||w||_1 => sparsity, feature selection
- Elastic Net: alpha*||w||_1 + (1-alpha)/2 * ||w||_2^2
Constrained view (equivalent under proper mapping): minimize NLL(w) subject to R(w) <= t. This perspective is useful to think of regularization strength as a budget on model complexity.
Bayesian interpretation (quick and delightful)
- L2 penalty corresponds to a Gaussian prior on w (mean 0, variance proportional to 1/lambda). Intuition: we believe weights should be near zero, with normally distributed fluctuations.
- L1 penalty corresponds to a Laplace (double-exponential) prior. Intuition: many weights should be exactly zero (sparse), but some may be large.
So regularization = inserting prior beliefs into the likelihood. Probabilistic people nod approvingly.
How regularization changes optimization
Objective to minimize: J(w) = NLL(w) + lambda*R(w)
- For L2: gradient adds lambda * w. Hessian gets lambda on diagonal (very convenient for Newton methods).
- For L1: nondifferentiable at zero — use coordinate descent, proximal methods, or subgradient methods.
If you used IRLS / Newton-Raphson for logistic regression, L2 regularization is a friendly guest: simply add lambda to the diagonal of the Hessian (or add lambda * w to the gradient) when computing updates. L1, less friendly, invites coordinate descent or proximal-newton tricks.
Pseudocode (batch gradient descent with L2)
for iter in range(max_iter):
p = sigmoid(X @ w + b)
grad_w = X.T @ (p - y) / n + lambda * w
grad_b = sum(p - y) / n
w -= lr * grad_w
b -= lr * grad_b
L1 vs L2 vs Elastic Net — the quick compare table
| Property | L2 (Ridge) | L1 (Lasso) | Elastic Net |
|---|---|---|---|
| Shrinks coefficients | Yes | Yes | Yes |
| Produces exact zeros | No | Yes | Sometimes |
| Works well with correlated features | Yes (distributes weight) | Tends to pick one | Can balance both |
| Optimization | Smooth, convex | Convex, nondiff | Convex, trickier |
| Bayesian prior | Gaussian | Laplace | Mixture-ish |
Practical tips and gotchas
- Always standardize features before regularization. L2 and L1 are scale-sensitive. If one feature has huge variance, its coefficient will be unfairly penalized less.
- Choose lambda with cross-validation. Lambda is the thermostat that controls bias-variance. Use grid search with k-fold CV or more advanced search (e.g., log-space grid + warm starts). Watch class imbalance — use stratified CV.
- Warm starts matter. When computing regularization paths (sequence of lambdas), initialize from previous solution to speed up.
- If you want interpretability, L1 (or Elastic Net) helps. L1 produces sparse models; Elastic Net handles correlated predictors better than pure L1.
- Remember the separation issue. If classes are perfectly separable, MLE diverges; L2 rescues by producing finite coefficients.
- Regularize the bias term carefully. Often do NOT penalize the intercept b. It shifts the decision boundary without increasing model complexity in the same way as slope coefficients.
- Optimization defaults matter. Libs like scikit-learn use liblinear or saga solvers; choose solver matching your penalty and data size.
Advanced: regularization paths and model selection
- Compute a sequence of lambdas from large (all-coefs ≈ 0) to small (near-MLE) and track coefficient trajectories. These "paths" show how features enter the model.
- Use AIC/BIC cautiously — they assume large-sample properties; CV is more robust for predictive tasks.
Multiclass extension
For multinomial logistic (softmax), regularization is exactly the same idea: add penalties on the weight matrix. Use L2 for smoothness; use group-lasso variants if you want entire features excluded across all classes.
Closing: TL;DR + Challenge
- Regularized logistic regression = NLL + penalty. L2 for stable, smooth shrinkage; L1 for sparsity; Elastic Net for both worlds.
- Regularization fixes separation, reduces variance, and encodes prior beliefs (Bayesian view).
- Standardize your features. Cross-validate lambda. Don’t penalize the intercept.
Quote for the road:
"Regularization is the humility pill for machine learning models — it reminds them they do not know everything."
Challenge: take a high-dimensional classification dataset (p > n) and fit three models: L2, L1, and Elastic Net. Plot coefficient paths vs log(lambda), and report which features survive and how that affects test AUC. Try with and without feature standardization — notice the chaos when you skip it.
Go forth and regularize responsibly.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!