jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Supervised Machine Learning: Regression and Classification
Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

Bernoulli and Binomial LikelihoodLink Functions and the LogitMaximum Likelihood EstimationRegularized Logistic RegressionDecision Boundaries and GeometryOne-vs-Rest and Multinomial LogisticClass Probability EstimationFeature Scaling and ConvergenceInterpreting Coefficients and Odds RatiosHandling Linearly Separable DataClass Weights and Cost-Sensitive LearningBaseline and Dummy ClassifiersNaive Bayes ClassifiersOverfitting in Logistic ModelsSparse High-Dimensional Settings

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Classification I: Logistic Regression and Probabilistic View

Classification I: Logistic Regression and Probabilistic View

23088 views

Model class probabilities with logistic regression and related probabilistic classifiers.

Content

4 of 15

Regularized Logistic Regression

Regularized Logistic Regression — Sass + Stats
1463 views
intermediate
humorous
science
machine-learning
gpt-5-mini
1463 views

Versions:

Regularized Logistic Regression — Sass + Stats

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Regularized Logistic Regression: Taming Coefs Without Losing Your Mind

"Logistic regression is simple — until your features start auditioning for a reality show and overfitting the training set."

You already saw logistic regression through the probabilistic lens (logit link) and learned to fit it via maximum likelihood. You also met regularization in regression land (Ridge, Lasso, Elastic Net). Now we combine those powers: regularized logistic regression. Same elegant probabilistic model; now with a leash on the parameters so they don't go wild and ruin generalization.


Why regularize logistic regression?

  • Because MLE loves to overfit. Maximum likelihood estimation maximizes the training log-likelihood. If you have lots of features, noisy predictors, or near-separable classes, MLE will happily inflate coefficients to fit noise.
  • Because coefficients can blow up in separation. When classes are (nearly) linearly separable, MLE may not converge — coefficients diverge to infinity. Regularization saves the optimizer and the model.
  • Because we want stable, generalizable decision boundaries. Regularization penalizes complex models and often improves out-of-sample performance.

Think of regularization as a responsible friend who says, "Maybe don’t let your coefficients go all-in on that suspicious-looking feature."


Formulations: Penalized likelihood and constrained view

Starting from negative log-likelihood (NLL) for logistic regression:

NLL(w) = - sum_{i=1}^n [ y_i log p(x_i) + (1-y_i) log(1-p(x_i)) ]
where p(x) = sigma(w^T x + b)

Penalized (regularized) objective — minimize:

J(w) = NLL(w) + lambda * R(w)

Common choices:

  • L2 (Ridge): R(w) = 1/2 * ||w||_2^2 => smooth shrinkage
  • L1 (Lasso): R(w) = ||w||_1 => sparsity, feature selection
  • Elastic Net: alpha*||w||_1 + (1-alpha)/2 * ||w||_2^2

Constrained view (equivalent under proper mapping): minimize NLL(w) subject to R(w) <= t. This perspective is useful to think of regularization strength as a budget on model complexity.


Bayesian interpretation (quick and delightful)

  • L2 penalty corresponds to a Gaussian prior on w (mean 0, variance proportional to 1/lambda). Intuition: we believe weights should be near zero, with normally distributed fluctuations.
  • L1 penalty corresponds to a Laplace (double-exponential) prior. Intuition: many weights should be exactly zero (sparse), but some may be large.

So regularization = inserting prior beliefs into the likelihood. Probabilistic people nod approvingly.


How regularization changes optimization

Objective to minimize: J(w) = NLL(w) + lambda*R(w)

  • For L2: gradient adds lambda * w. Hessian gets lambda on diagonal (very convenient for Newton methods).
  • For L1: nondifferentiable at zero — use coordinate descent, proximal methods, or subgradient methods.

If you used IRLS / Newton-Raphson for logistic regression, L2 regularization is a friendly guest: simply add lambda to the diagonal of the Hessian (or add lambda * w to the gradient) when computing updates. L1, less friendly, invites coordinate descent or proximal-newton tricks.

Pseudocode (batch gradient descent with L2)

for iter in range(max_iter):
  p = sigmoid(X @ w + b)
  grad_w = X.T @ (p - y) / n + lambda * w
  grad_b = sum(p - y) / n
  w -= lr * grad_w
  b -= lr * grad_b

L1 vs L2 vs Elastic Net — the quick compare table

Property L2 (Ridge) L1 (Lasso) Elastic Net
Shrinks coefficients Yes Yes Yes
Produces exact zeros No Yes Sometimes
Works well with correlated features Yes (distributes weight) Tends to pick one Can balance both
Optimization Smooth, convex Convex, nondiff Convex, trickier
Bayesian prior Gaussian Laplace Mixture-ish

Practical tips and gotchas

  • Always standardize features before regularization. L2 and L1 are scale-sensitive. If one feature has huge variance, its coefficient will be unfairly penalized less.
  • Choose lambda with cross-validation. Lambda is the thermostat that controls bias-variance. Use grid search with k-fold CV or more advanced search (e.g., log-space grid + warm starts). Watch class imbalance — use stratified CV.
  • Warm starts matter. When computing regularization paths (sequence of lambdas), initialize from previous solution to speed up.
  • If you want interpretability, L1 (or Elastic Net) helps. L1 produces sparse models; Elastic Net handles correlated predictors better than pure L1.
  • Remember the separation issue. If classes are perfectly separable, MLE diverges; L2 rescues by producing finite coefficients.
  • Regularize the bias term carefully. Often do NOT penalize the intercept b. It shifts the decision boundary without increasing model complexity in the same way as slope coefficients.
  • Optimization defaults matter. Libs like scikit-learn use liblinear or saga solvers; choose solver matching your penalty and data size.

Advanced: regularization paths and model selection

  • Compute a sequence of lambdas from large (all-coefs ≈ 0) to small (near-MLE) and track coefficient trajectories. These "paths" show how features enter the model.
  • Use AIC/BIC cautiously — they assume large-sample properties; CV is more robust for predictive tasks.

Multiclass extension

For multinomial logistic (softmax), regularization is exactly the same idea: add penalties on the weight matrix. Use L2 for smoothness; use group-lasso variants if you want entire features excluded across all classes.


Closing: TL;DR + Challenge

  • Regularized logistic regression = NLL + penalty. L2 for stable, smooth shrinkage; L1 for sparsity; Elastic Net for both worlds.
  • Regularization fixes separation, reduces variance, and encodes prior beliefs (Bayesian view).
  • Standardize your features. Cross-validate lambda. Don’t penalize the intercept.

Quote for the road:

"Regularization is the humility pill for machine learning models — it reminds them they do not know everything."

Challenge: take a high-dimensional classification dataset (p > n) and fit three models: L2, L1, and Elastic Net. Plot coefficient paths vs log(lambda), and report which features survive and how that affects test AUC. Try with and without feature standardization — notice the chaos when you skip it.

Go forth and regularize responsibly.

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics