Classification I: Logistic Regression and Probabilistic View
Model class probabilities with logistic regression and related probabilistic classifiers.
Content
Bernoulli and Binomial Likelihood
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Opening Section
Ever been told that logistic regression is just glorified linear regression with a personality disorder? Good — you are halfway to understanding the probabilistic heart of classification. Building on our previous tour of regularization and Bayesian linear regression, we now leave the Gaussian playground and enter the Bernoulli/Binomial nightclub: outcomes are 0 or 1, and the rules are delightfully discrete.
Why this matters: when your target is a yes or no, a click or not, a win or lose, modeling the data with the right likelihood is not optional if you want calibrated probabilities and sensible inference. Plus, using the proper likelihood gives you principled loss functions, nice gradients for optimization, and natural connections to regularization and Bayesian priors.
The Core Idea: Bernoulli for single trials, Binomial for aggregated trials
- Bernoulli likelihood is the model for a single binary outcome y in {0,1}.
- Binomial likelihood is the model for a count of successes k in n independent Bernoulli trials.
Think of Bernoulli as a single coin flip, and Binomial as the summary result after n flips. In practice, you may use Binomial when your data is aggregated, like "k clicks out of n impressions for this ad".
Bernoulli likelihood, written plainly
In words: the probability of observing y given a success probability mu is mu when y=1 and 1-mu when y=0. Compactly:
p(y | mu) = mu^y * (1 - mu)^(1 - y), y in {0,1}
If mu depends on features x through model parameters w, we write mu = mu(x; w). Logistic regression uses the logistic or sigmoid link:
mu(x; w) = sigma(w^T x) = 1 / (1 + exp(-w^T x))
Binomial likelihood for aggregated counts
If you observe k successes out of n trials at the same covariate x, the likelihood is
p(k | n, mu) = C(n, k) * mu^k * (1 - mu)^(n - k)
where C(n, k) is the combinatorial coefficient. The combinatorial term does not depend on mu, so for parameter estimation you often ignore it.
The Log Likelihood and its Lossy Twin, Cross Entropy
For N independent Bernoulli observations (x_i, y_i):
log L(w) = sum_{i=1}^N [ y_i * log mu_i + (1 - y_i) * log(1 - mu_i) ],
where mu_i = sigma(w^T x_i)
We typically minimize the negative log likelihood, which produces the familiar binary cross entropy or logistic loss:
NLL(w) = -sum_{i=1}^N [ y_i log mu_i + (1 - y_i) log(1 - mu_i) ]
So: maximizing the Bernoulli likelihood = minimizing cross entropy. If you have aggregated data (k_i successes in n_i trials) the negative log likelihood becomes
NLL(w) = -sum_i [ k_i log mu_i + (n_i - k_i) log(1 - mu_i) ] + const
which is just a weighted cross entropy where weights are the number of trials n_i.
The statistical skeleton beneath logistic regression is just the Bernoulli/Binomial likelihood wearing a sigmoid suit.
Gradient, Fisher Information, and Why Optimization Is Friendly
Derivative of the log likelihood with respect to w is remarkably clean:
grad_w log L = sum_i (y_i - mu_i) * x_i
That is the same residual form you know from linear regression, except residuals are now y - mu, not y - y_hat in a Gaussian sense. The Hessian involves mu_i (1 - mu_i) and leads to the well known Iteratively Reweighted Least Squares (IRLS) algorithm for finding MLEs via Newton steps.
Why this is useful:
- You get a convex negative log likelihood for logistic regression, so optimization is safe.
- IRLS uses weights mu(1 - mu), which punish overconfident wrong predictions during updates.
Link to Regularization and Bayesian Intuition (building on earlier topics)
Remember Bayesian linear regression where Gaussian noise led to least squares and an L2 penalty arises from a Gaussian prior on weights? The analogy carries over.
- Putting a Gaussian prior w ~ N(0, sigma2 I) on the logistic weights yields a posterior whose mode is the MAP estimate. Minimizing negative log posterior gives
NLL(w) + (lambda / 2) ||w||^2
which is logistic loss with ridge regularization. That is how regularization from Regression II plugs directly into classification.
For Binomial counts, a conjugate prior on mu is Beta(alpha, beta). Observing k successes in n trials updates alpha -> alpha + k and beta -> beta + n - k. This is a smoothing miracle: Beta priors explain Laplace smoothing and pseudocounts in a principled way. Note: Beta is conjugate to Binomial for the parameter mu, but not conjugate to logistic regression parameters w, because mu depends on w through a nonlinear sigmoid.
Bayesian logistic regression is possible but not conjugate. We need approximations (Laplace, variational, or MCMC), unlike the closed form we enjoyed for Gaussian linear regression.
Practical Notes, Gotchas, and Intuition
- When to use Binomial instead of repeating Bernoulli rows: if you truly have aggregated counts at the same x, modeling counts with Binomial is more efficient and statistically correct.
- Imbalanced classes: cross entropy will still work, but you may use class weights or modify n_i in Binomial setups to reflect sampling strategies.
- Calibration: logistic regression gives probabilities that are often better calibrated than ad hoc scores, because the model explicitly maximizes a probability-based likelihood.
- Avoid the linear probability model trap: linear regression on binary labels leads to heteroskedastic errors and predicted probabilities outside [0, 1]. Bernoulli likelihood fixes both problems.
Quick question to chew on: imagine two groups, each with 10 trials. One has 1 success and the other has 9. If you aggregate across groups you might get misleading estimates. How would Beta priors and Binomial likelihoods help mitigate this? (Hint: smoothing and respecting group-level evidence.)
Examples that Stick
- Spam detection: Each email is a Bernoulli trial, mu(x) is probability email is spam. Fit logistic regression by minimizing cross entropy.
- Ad CTR aggregated logs: For each ad and day you observe k clicks out of n impressions. Fit a model with Binomial likelihood; use Beta priors for smoothing low-impression ads.
- Medical trials: number of recovery events k in n patients with treatment x. Binomial likelihood is natural and interpretable.
Closing Section: Key Takeaways and the One-Liner You Will Repeat
- Bernoulli models single binary outcomes; Binomial models counts of successes across multiple independent trials.
- Logistic regression arises from using a Bernoulli (or Binomial) likelihood with a logistic link mu = sigma(w^T x).
- Negative log likelihood equals binary cross entropy; gradients are tidy and convex optimization is tractable.
- Regularization = prior. Ridge on logistic regression = Gaussian prior on w. Beta prior for Binomial = smoothing and pseudocounts.
One-liner to say with authority: Use the Bernoulli/Binomial likelihood because it gives you the right loss, the right gradients, and the right way to fold in prior knowledge. Everything else is just a workaround.
Go forth and model your binary outcomes like a pro, and when someone suggests OLS for binary data, look at them with deep, scholarly pity.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!