Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

Bernoulli and Binomial Likelihood Link Functions and the Logit Maximum Likelihood Estimation Regularized Logistic Regression Decision Boundaries and Geometry One-vs-Rest and Multinomial Logistic Class Probability Estimation Feature Scaling and Convergence Interpreting Coefficients and Odds Ratios Handling Linearly Separable Data Class Weights and Cost-Sensitive Learning Baseline and Dummy Classifiers Naive Bayes Classifiers Overfitting in Logistic Models Sparse High-Dimensional Settings

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Classification I: Logistic Regression and Probabilistic View

Classification I: Logistic Regression and Probabilistic View

23095 views

Model class probabilities with logistic regression and related probabilistic classifiers.

Content

1 of 15

Bernoulli and Binomial Likelihood

Logistic Love: Bernoulli & Binomial Likelihood (Sassy TA Edition)

5331 views

intermediate

humorous

science

machine-learning

gpt-5-mini

5331 views

Versions:

Logistic Love: Bernoulli & Binomial Likelihood (Sassy TA Edition)

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Opening Section

Ever been told that logistic regression is just glorified linear regression with a personality disorder? Good — you are halfway to understanding the probabilistic heart of classification. Building on our previous tour of regularization and Bayesian linear regression, we now leave the Gaussian playground and enter the Bernoulli/Binomial nightclub: outcomes are 0 or 1, and the rules are delightfully discrete.

Why this matters: when your target is a yes or no, a click or not, a win or lose, modeling the data with the right likelihood is not optional if you want calibrated probabilities and sensible inference. Plus, using the proper likelihood gives you principled loss functions, nice gradients for optimization, and natural connections to regularization and Bayesian priors.

The Core Idea: Bernoulli for single trials, Binomial for aggregated trials

Bernoulli likelihood is the model for a single binary outcome y in {0,1}.
Binomial likelihood is the model for a count of successes k in n independent Bernoulli trials.

Think of Bernoulli as a single coin flip, and Binomial as the summary result after n flips. In practice, you may use Binomial when your data is aggregated, like "k clicks out of n impressions for this ad".

Bernoulli likelihood, written plainly

In words: the probability of observing y given a success probability mu is mu when y=1 and 1-mu when y=0. Compactly:

p(y | mu) = mu^y * (1 - mu)^(1 - y),    y in {0,1}

If mu depends on features x through model parameters w, we write mu = mu(x; w). Logistic regression uses the logistic or sigmoid link:

mu(x; w) = sigma(w^T x) = 1 / (1 + exp(-w^T x))

Binomial likelihood for aggregated counts

If you observe k successes out of n trials at the same covariate x, the likelihood is

p(k | n, mu) = C(n, k) * mu^k * (1 - mu)^(n - k)

where C(n, k) is the combinatorial coefficient. The combinatorial term does not depend on mu, so for parameter estimation you often ignore it.

The Log Likelihood and its Lossy Twin, Cross Entropy

For N independent Bernoulli observations (x_i, y_i):

log L(w) = sum_{i=1}^N [ y_i * log mu_i + (1 - y_i) * log(1 - mu_i) ],
where mu_i = sigma(w^T x_i)

We typically minimize the negative log likelihood, which produces the familiar binary cross entropy or logistic loss:

NLL(w) = -sum_{i=1}^N [ y_i log mu_i + (1 - y_i) log(1 - mu_i) ]

So: maximizing the Bernoulli likelihood = minimizing cross entropy. If you have aggregated data (k_i successes in n_i trials) the negative log likelihood becomes

NLL(w) = -sum_i [ k_i log mu_i + (n_i - k_i) log(1 - mu_i) ] + const

which is just a weighted cross entropy where weights are the number of trials n_i.

The statistical skeleton beneath logistic regression is just the Bernoulli/Binomial likelihood wearing a sigmoid suit.

Gradient, Fisher Information, and Why Optimization Is Friendly

Derivative of the log likelihood with respect to w is remarkably clean:

grad_w log L = sum_i (y_i - mu_i) * x_i

That is the same residual form you know from linear regression, except residuals are now y - mu, not y - y_hat in a Gaussian sense. The Hessian involves mu_i (1 - mu_i) and leads to the well known Iteratively Reweighted Least Squares (IRLS) algorithm for finding MLEs via Newton steps.

Why this is useful:

You get a convex negative log likelihood for logistic regression, so optimization is safe.
IRLS uses weights mu(1 - mu), which punish overconfident wrong predictions during updates.

Link to Regularization and Bayesian Intuition (building on earlier topics)

Remember Bayesian linear regression where Gaussian noise led to least squares and an L2 penalty arises from a Gaussian prior on weights? The analogy carries over.

Putting a Gaussian prior w ~ N(0, sigma2 I) on the logistic weights yields a posterior whose mode is the MAP estimate. Minimizing negative log posterior gives

NLL(w) + (lambda / 2) ||w||^2

which is logistic loss with ridge regularization. That is how regularization from Regression II plugs directly into classification.

For Binomial counts, a conjugate prior on mu is Beta(alpha, beta). Observing k successes in n trials updates alpha -> alpha + k and beta -> beta + n - k. This is a smoothing miracle: Beta priors explain Laplace smoothing and pseudocounts in a principled way. Note: Beta is conjugate to Binomial for the parameter mu, but not conjugate to logistic regression parameters w, because mu depends on w through a nonlinear sigmoid.
Bayesian logistic regression is possible but not conjugate. We need approximations (Laplace, variational, or MCMC), unlike the closed form we enjoyed for Gaussian linear regression.

Practical Notes, Gotchas, and Intuition

When to use Binomial instead of repeating Bernoulli rows: if you truly have aggregated counts at the same x, modeling counts with Binomial is more efficient and statistically correct.
Imbalanced classes: cross entropy will still work, but you may use class weights or modify n_i in Binomial setups to reflect sampling strategies.
Calibration: logistic regression gives probabilities that are often better calibrated than ad hoc scores, because the model explicitly maximizes a probability-based likelihood.
Avoid the linear probability model trap: linear regression on binary labels leads to heteroskedastic errors and predicted probabilities outside [0, 1]. Bernoulli likelihood fixes both problems.

Quick question to chew on: imagine two groups, each with 10 trials. One has 1 success and the other has 9. If you aggregate across groups you might get misleading estimates. How would Beta priors and Binomial likelihoods help mitigate this? (Hint: smoothing and respecting group-level evidence.)

Examples that Stick

Spam detection: Each email is a Bernoulli trial, mu(x) is probability email is spam. Fit logistic regression by minimizing cross entropy.
Ad CTR aggregated logs: For each ad and day you observe k clicks out of n impressions. Fit a model with Binomial likelihood; use Beta priors for smoothing low-impression ads.
Medical trials: number of recovery events k in n patients with treatment x. Binomial likelihood is natural and interpretable.

Closing Section: Key Takeaways and the One-Liner You Will Repeat

Bernoulli models single binary outcomes; Binomial models counts of successes across multiple independent trials.
Logistic regression arises from using a Bernoulli (or Binomial) likelihood with a logistic link mu = sigma(w^T x).
Negative log likelihood equals binary cross entropy; gradients are tidy and convex optimization is tractable.
Regularization = prior. Ridge on logistic regression = Gaussian prior on w. Beta prior for Binomial = smoothing and pseudocounts.

One-liner to say with authority: Use the Bernoulli/Binomial likelihood because it gives you the right loss, the right gradients, and the right way to fold in prior knowledge. Everything else is just a workaround.

Go forth and model your binary outcomes like a pro, and when someone suggests OLS for binary data, look at them with deep, scholarly pity.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics