Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

Bernoulli and Binomial Likelihood Link Functions and the Logit Maximum Likelihood Estimation Regularized Logistic Regression Decision Boundaries and Geometry One-vs-Rest and Multinomial Logistic Class Probability Estimation Feature Scaling and Convergence Interpreting Coefficients and Odds Ratios Handling Linearly Separable Data Class Weights and Cost-Sensitive Learning Baseline and Dummy Classifiers Naive Bayes Classifiers Overfitting in Logistic Models Sparse High-Dimensional Settings

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Classification I: Logistic Regression and Probabilistic View

Classification I: Logistic Regression and Probabilistic View

23100 views

Model class probabilities with logistic regression and related probabilistic classifiers.

Content

4 of 15

Regularized Logistic Regression

Regularized Logistic Regression — Sass + Stats

1464 views

intermediate

humorous

science

machine-learning

gpt-5-mini

1464 views

Versions:

Regularized Logistic Regression — Sass + Stats

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Regularized Logistic Regression: Taming Coefs Without Losing Your Mind

"Logistic regression is simple — until your features start auditioning for a reality show and overfitting the training set."

You already saw logistic regression through the probabilistic lens (logit link) and learned to fit it via maximum likelihood. You also met regularization in regression land (Ridge, Lasso, Elastic Net). Now we combine those powers: regularized logistic regression. Same elegant probabilistic model; now with a leash on the parameters so they don't go wild and ruin generalization.

Why regularize logistic regression?

Because MLE loves to overfit. Maximum likelihood estimation maximizes the training log-likelihood. If you have lots of features, noisy predictors, or near-separable classes, MLE will happily inflate coefficients to fit noise.
Because coefficients can blow up in separation. When classes are (nearly) linearly separable, MLE may not converge — coefficients diverge to infinity. Regularization saves the optimizer and the model.
Because we want stable, generalizable decision boundaries. Regularization penalizes complex models and often improves out-of-sample performance.

Think of regularization as a responsible friend who says, "Maybe don’t let your coefficients go all-in on that suspicious-looking feature."

Formulations: Penalized likelihood and constrained view

Starting from negative log-likelihood (NLL) for logistic regression:

NLL(w) = - sum_{i=1}^n [ y_i log p(x_i) + (1-y_i) log(1-p(x_i)) ]
where p(x) = sigma(w^T x + b)

Penalized (regularized) objective — minimize:

J(w) = NLL(w) + lambda * R(w)

Common choices:

L2 (Ridge): R(w) = 1/2 * ||w||_2^2 => smooth shrinkage
L1 (Lasso): R(w) = ||w||_1 => sparsity, feature selection
Elastic Net: alpha*||w||_1 + (1-alpha)/2 * ||w||_2^2

Constrained view (equivalent under proper mapping): minimize NLL(w) subject to R(w) <= t. This perspective is useful to think of regularization strength as a budget on model complexity.

Bayesian interpretation (quick and delightful)

L2 penalty corresponds to a Gaussian prior on w (mean 0, variance proportional to 1/lambda). Intuition: we believe weights should be near zero, with normally distributed fluctuations.
L1 penalty corresponds to a Laplace (double-exponential) prior. Intuition: many weights should be exactly zero (sparse), but some may be large.

So regularization = inserting prior beliefs into the likelihood. Probabilistic people nod approvingly.

How regularization changes optimization

Objective to minimize: J(w) = NLL(w) + lambda*R(w)

For L2: gradient adds lambda * w. Hessian gets lambda on diagonal (very convenient for Newton methods).
For L1: nondifferentiable at zero — use coordinate descent, proximal methods, or subgradient methods.

If you used IRLS / Newton-Raphson for logistic regression, L2 regularization is a friendly guest: simply add lambda to the diagonal of the Hessian (or add lambda * w to the gradient) when computing updates. L1, less friendly, invites coordinate descent or proximal-newton tricks.

Pseudocode (batch gradient descent with L2)

for iter in range(max_iter):
  p = sigmoid(X @ w + b)
  grad_w = X.T @ (p - y) / n + lambda * w
  grad_b = sum(p - y) / n
  w -= lr * grad_w
  b -= lr * grad_b

L1 vs L2 vs Elastic Net — the quick compare table

Property	L2 (Ridge)	L1 (Lasso)	Elastic Net
Shrinks coefficients	Yes	Yes	Yes
Produces exact zeros	No	Yes	Sometimes
Works well with correlated features	Yes (distributes weight)	Tends to pick one	Can balance both
Optimization	Smooth, convex	Convex, nondiff	Convex, trickier
Bayesian prior	Gaussian	Laplace	Mixture-ish

Practical tips and gotchas

Always standardize features before regularization. L2 and L1 are scale-sensitive. If one feature has huge variance, its coefficient will be unfairly penalized less.
Choose lambda with cross-validation. Lambda is the thermostat that controls bias-variance. Use grid search with k-fold CV or more advanced search (e.g., log-space grid + warm starts). Watch class imbalance — use stratified CV.
Warm starts matter. When computing regularization paths (sequence of lambdas), initialize from previous solution to speed up.
If you want interpretability, L1 (or Elastic Net) helps. L1 produces sparse models; Elastic Net handles correlated predictors better than pure L1.
Remember the separation issue. If classes are perfectly separable, MLE diverges; L2 rescues by producing finite coefficients.
Regularize the bias term carefully. Often do NOT penalize the intercept b. It shifts the decision boundary without increasing model complexity in the same way as slope coefficients.
Optimization defaults matter. Libs like scikit-learn use liblinear or saga solvers; choose solver matching your penalty and data size.

Advanced: regularization paths and model selection

Compute a sequence of lambdas from large (all-coefs ≈ 0) to small (near-MLE) and track coefficient trajectories. These "paths" show how features enter the model.
Use AIC/BIC cautiously — they assume large-sample properties; CV is more robust for predictive tasks.

Multiclass extension

For multinomial logistic (softmax), regularization is exactly the same idea: add penalties on the weight matrix. Use L2 for smoothness; use group-lasso variants if you want entire features excluded across all classes.

Closing: TL;DR + Challenge

Regularized logistic regression = NLL + penalty. L2 for stable, smooth shrinkage; L1 for sparsity; Elastic Net for both worlds.
Regularization fixes separation, reduces variance, and encodes prior beliefs (Bayesian view).
Standardize your features. Cross-validate lambda. Don’t penalize the intercept.

Quote for the road:

"Regularization is the humility pill for machine learning models — it reminds them they do not know everything."

Challenge: take a high-dimensional classification dataset (p > n) and fit three models: L2, L1, and Elastic Net. Plot coefficient paths vs log(lambda), and report which features survive and how that affects test AUC. Try with and without feature standardization — notice the chaos when you skip it.

Go forth and regularize responsibly.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics