Classification I: Logistic Regression and Probabilistic View
Model class probabilities with logistic regression and related probabilistic classifiers.
Content
Maximum Likelihood Estimation
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Maximum Likelihood Estimation (MLE) — Make the Data Say Itself
"Find the parameter values that make the data you actually saw the most believable." — Your future statistical self
You already know the link function and the logit: we use the sigmoid to turn linear scores into probabilities, and the Bernoulli/Binomial likelihood to express how likely labels are given those probabilities. Now let’s finish the story: how do we pick the parameters of that sigmoid-y machine so the observed labels look most likely? Enter: Maximum Likelihood Estimation (MLE).
What MLE is, in plain (and slightly dramatic) English
- MLE chooses parameters that maximize the probability of the observed data under the model.
- In logistic regression, the model gives us
p(y=1 | x, w) = sigma(w^T x)wheresigma(t) = 1 / (1 + e^{-t}). (This is the logit/link you've seen.) - The labels for individual samples are Bernoulli, so the likelihood of a single observation
(x_i, y_i)isp(y_i | x_i, w) = sigma(w^T x_i)^{y_i} (1 - sigma(w^T x_i))^{1-y_i}. You saw the Bernoulli likelihood before; now we chain them together across the dataset.
Write the likelihood, take the log, do the math (but not too much)
Dataset: (x_i, y_i) for i = 1..n.
Likelihood:
L(w) = Π_{i=1}^n p(y_i | x_i, w) = Π_{i=1}^n sigma(w^T x_i)^{y_i} (1 - sigma(w^T x_i))^{1-y_i}
Log-likelihood (because multiplication is rude):
ℓ(w) = log L(w) = Σ_{i=1}^n [ y_i log sigma(w^T x_i) + (1 - y_i) log (1 - sigma(w^T x_i)) ]
This negative log-likelihood is exactly the binary cross-entropy loss commonly used in ML frameworks. So when you write loss = -sum(y * log(p) + (1-y) * log(1-p)), congratulations: you are doing MLE.
Optimization: how to get from formula to numbers
We want w_hat = argmax_w ℓ(w) (or equivalently argmin_w -ℓ(w)). Important characteristics:
- The negative log-likelihood for logistic regression is convex in
w. That means a global minimum exists and optimization is stable — breathe easy. - Common optimizers:
- Gradient descent / SGD (works well for large datasets)
- Newton-Raphson / IRLS (Iteratively Reweighted Least Squares) — faster convergence for moderate-size data because it uses curvature (Hessian)
Gradient (vector):
∇ℓ(w) = Σ_{i=1}^n (y_i - p_i) x_i where p_i = sigma(w^T x_i)
Hessian (matrix):
H(w) = - Σ_{i=1}^n p_i (1 - p_i) x_i x_i^T
Newton update (one step):
w_new = w_old - H(w_old)^{-1} ∇(-ℓ(w_old))
IRLS uses this to reframe logistic regression as a sequence of weighted least-squares problems — very 19th-century-math-meets-modern-computing.
MLE, regularization, and the MAP perspective — the bridge to Regression II
Remember when in Regression II we used ridge to keep coefficients from going wild? Regularization has a probabilistic interpretation.
Put a Gaussian prior on
w:p(w) ∝ exp(-λ ||w||^2 / 2). Maximizing the posteriorp(w | data)is equivalent to maximizing the likelihood plus a penalty: this is MAP, not MLE.That means ridge regression = MAP estimate with Gaussian prior. In logistic regression, adding
λ ||w||^2to the loss is the same idea — keep parameters conservative, prevent overfitting, improve generalization.
Table: Likelihood vs Penalized Likelihood
| Objective | Equivalent to | Effect |
|---|---|---|
-ℓ(w) |
MLE | Fit model to data only |
| `-ℓ(w) + λ | w |
So the previous section on regularization is not an unrelated laundry list — it’s the same probabilistic framework with a little prior spice.
Intuition, examples, and practical gotchas
Why MLE? Because we want parameters that make the observed labels most probable. It’s like tuning the knobs so the model could have plausibly produced what we actually saw.
Calibration: MLE-trained logistic models give well-calibrated probabilities (often) because they are trained to match observed frequencies. But beware of class imbalance — the likelihood will prioritize fitting the majority class unless you weight or adjust.
Class imbalance fix: weighted log-likelihood
ℓ(w) = Σ w_i [ y_i log p_i + (1-y_i) log(1-p_i) ]
Overfitting fix: regularize. (See the bridge above.)
Numerical stability: compute
log sigmaandlog(1 - sigma)carefully (use log-sum-exp tricks or stable sigmoid computations).
Asymptotics and Uncertainty
MLE is not just a point estimate. Under regular conditions:
w_hatis consistent and asymptotically normal:w_hat ~ N(w_true, I(w_true)^{-1} / n), whereIis the Fisher information.- The covariance of the estimate is approximately
(-H(w_hat))^{-1}. That’s how you get standard errors and Wald tests for coefficients.
In short: MLE gives you parameters and a built-in way to quantify how sure you are about them.
Algorithm sketch (IRLS / Newton for logistic)
- Initialize
w = 0(or small random) - Repeat until convergence:
- Compute
p_i = sigma(w^T x_i) - Form weight matrix
W = diag(p_i (1-p_i)) - Compute working response
z = X w + W^{-1} (y - p) - Solve weighted least squares
w_new = (X^T W X)^{-1} X^T W z
- Compute
- Output
w
(If n is huge, use SGD instead. If X is poorly scaled, standardize first.)
Quick checklist before you run MLE on your next dataset
- Standardize features (helps optimization and priors make sense)
- Consider class weights for imbalanced labels
- Add regularization (λ) — tune by cross-validation
- Use appropriate optimizer: SGD for scale, Newton/IRLS for speed on small-medium data
- Compute standard errors via the Hessian or bootstrap if you care about inference
Final mic-drop takeaways
- MLE = choose parameters that maximize the probability of the observed labels. For logistic regression this is exactly what binary cross-entropy does.
- Convex problem → global optimum, lots of nice math, practical stability.
- Regularization = MAP: regularized logistic regression is just Bayesian common sense in optimization clothing.
- Optimization choices: SGD for big data, Newton/IRLS for small-medium. Watch scaling, balance, and numerical stability.
Now go forth: fit models, question assumptions, and when someone asks whether you used MLE, answer with confidence and a faintly theatrical bow.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!