Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

Bernoulli and Binomial Likelihood Link Functions and the Logit Maximum Likelihood Estimation Regularized Logistic Regression Decision Boundaries and Geometry One-vs-Rest and Multinomial Logistic Class Probability Estimation Feature Scaling and Convergence Interpreting Coefficients and Odds Ratios Handling Linearly Separable Data Class Weights and Cost-Sensitive Learning Baseline and Dummy Classifiers Naive Bayes Classifiers Overfitting in Logistic Models Sparse High-Dimensional Settings

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Classification I: Logistic Regression and Probabilistic View

Classification I: Logistic Regression and Probabilistic View

23100 views

Model class probabilities with logistic regression and related probabilistic classifiers.

Content

2 of 15

Link Functions and the Logit

Logit: The VIP Pass for Probabilities (Canonical & Practical)

3006 views

intermediate

humorous

machine learning

probabilistic

gpt-5-mini

3006 views

Versions:

Logit: The VIP Pass for Probabilities (Canonical & Practical)

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Link Functions and the Logit — The VIP Pass from Linear Scores to Valid Probabilities

"You can't just spit out a number and call it a probability." — Every careful statistician, ever.

Imagine your linear model is a confident friend who insists on giving you blunt scores: "This person is a 3.2 on the friendliness scale." Cute, but friendship probabilities must live between 0 and 1. Enter link functions — the etiquette coach that transforms raw linear predictions into well-behaved probabilities.

This builds directly on our earlier work with the Bernoulli and Binomial likelihoods (we used that to define the proper likelihood for binary outcomes) and on regularization tricks from Regression II (yes, those ridge and lasso muscles still matter — we'll call them in for backup). Here we show the mathematical costume change: linear predictor -> link -> probability.

Quick reminder (no rehashing the whole course)

From the Bernoulli likelihood we know the probability of a binary outcome y ∈ {0,1} is controlled by π(x) = P(y=1 | x).
In a generalized linear model (GLM) we posit a linear predictor η(x) = β_0 + β^T x that summarizes evidence from features.
A link function g(·) maps the mean (here π) to the linear predictor: g(π) = η. The inverse g^{-1}(η) returns a probability in (0,1).

Why not just model π directly with a linear function? Because a linear function can spit values outside [0,1]. Links fix that.

The Logit: canonical, interpretable, and annoyingly elegant

The logit link is defined as the log of the odds:

Odds = π / (1 − π) (ratio of success probability to failure probability)
Logit(π) = log(π / (1 − π))

So in logistic regression we assert

logit(π(x)) = η(x) = β_0 + β^T x.

Solving for π gives the sigmoid (aka logistic function):

π(x) = 1 / (1 + exp(−η(x))).

This is the magic: η ∈ R maps to π ∈ (0,1) smoothly and monotonically.

Why log-odds are useful (interpretability moment)

A unit change in feature x_j changes the log-odds by β_j.
Equivalently, it multiplies the odds by exp(β_j) — called the odds ratio.

So β_j = log(odds ratio per unit change in x_j). Numeric intuition:

If β_j = 0.69, exp(0.69) ≈ 2, so each unit increase in x_j doubles the odds of y=1.
If β_j = −0.69, the odds are halved.

This multiplicative property is why social scientists and epidemiologists adore logistic models: you get direct statements about how odds scale.

The GLM/Exponential-family origin — why logit is canonical for Bernoulli

Bernoulli pmf: p(y|π) = π^y (1 − π)^{1−y} can be written in exponential-family form. The natural parameter (canonical parameter) turns out to be

θ = log(π / (1 − π)) — the logit.

That makes the logit the canonical link for Bernoulli GLM. Canonical links often simplify inference and yield tidy score equations (IRLS / Newton-Raphson emerges naturally).

Loss connection: negative log-likelihood = cross-entropy

Fitting logistic regression by maximum likelihood means minimizing the negative log-likelihood for Bernoulli outcomes. For a dataset {(x_i, y_i)} this is:

L(β) = −Σ[y_i log π_i + (1−y_i) log(1−π_i)], where π_i = sigmoid(β^T x_i).

That's exactly the binary cross-entropy (aka log loss) commonly used in machine learning. So when someone says "we're minimizing cross-entropy," they're really saying "we're maximizing Bernoulli likelihood with a logit link."

Alternative link functions (yes, variety exists)

Link	Formula for g(π)	Inverse g^{-1}(η)	When you might care
Logit	log(π/(1−π))	sigmoid(η)	Canonical, interpretable odds-ratios, ubiquitous
Probit	Φ^{-1}(π) (inverse normal CDF)	Φ(η) (normal CDF)	Latent variable or Gaussian-process interpretations; often similar to logit
Cloglog (complementary log-log)	log(−log(1−π))	1 − exp(−exp(η))	Models asymmetric tail behavior; used when extreme events matter

All of these are monotonic and map (0,1)↔R. Choice often pragmatic: logit is default; probit gives slightly different tails; cloglog asymmetric.

Numerical and practical concerns (aka “don’t let your model explode”)

Stability: computing log(π) or log(1−π) directly can underflow. Use log-sum-exp-like tricks or libraries' stable implementations.
Separation: if classes are perfectly separable, MLE for logistic regression yields coefficients that diverge to ±∞. Regularization (remember ridge/lasso from Regression II?) rescues you by penalizing large β. L2 (ridge) is common for logistic => penalized likelihood.
Thresholding: decision threshold isn't fixed at 0.5. Depending on class balance and loss asymmetry, pick threshold by precision/recall tradeoff or ROC analysis.
Interpretability vs. predictive power: link choice affects probability calibration but not the linear decision boundary (classification decision boundary remains linear in feature space because η(x) is linear). The probability shape and calibration do change.

Small numeric example (feel the math, briefly)

Suppose η(x) = −1 + 2 x1, with a single feature x1.

For x1 = 0: η = −1 ⇒ π = sigmoid(−1) ≈ 0.269
For x1 = 1: η = 1 ⇒ π = sigmoid(1) ≈ 0.731

Odds ratio per unit x1 increase = exp(2) ≈ 7.39. So going from 0→1 multiplies the odds by ~7.4.

Code sketch (Python-ish):

import numpy as np
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

beta0, beta1 = -1.0, 2.0
for x in [0,1]:
    eta = beta0 + beta1 * x
    print(x, sigmoid(eta))

Practical checklist when using logit link

Standardize features if you plan to regularize (so penalties act uniformly).
Use L2 for numerical stability and when multicollinearity or separation lurks; use L1 when you want sparsity.
Inspect calibration (reliability plots) — good classification accuracy ≠ calibrated probabilities.
For rare events, consider cloglog or specialized sampling/weighting strategies.

Closing rant — what makes the logit cool

The logit is the elegant bridge between straight-up linear thinking and probabilistic sanity. It gives you interpretability (log-odds and odds ratios), computational convenience (canonical link), and a smooth mapping to probabilities. Paired with regularization from Regression II, logistic regression becomes both stable and explainable — which is a rare combo in ML land.

If linear models are the reliable sedan of machine learning, the logit link is the seat belt and airbags — it keeps your probabilities in the lane and helps you survive the crashes.

Key takeaways:

Link functions map means to linear predictors; logit maps probabilities to log-odds.
Logit is canonical for Bernoulli: leads to logistic function (sigmoid) and cross-entropy loss.
Coefficients are interpretable as changes in log-odds; exponentiated coefficients are odds ratios.
Regularization (ridge/lasso) still matters — prevents divergence under separation and improves generalization.

Go forth and link responsibly. Or at least, link well enough that your model's probabilities aren't lying to you.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics