Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

Bernoulli and Binomial Likelihood Link Functions and the Logit Maximum Likelihood Estimation Regularized Logistic Regression Decision Boundaries and Geometry One-vs-Rest and Multinomial Logistic Class Probability Estimation Feature Scaling and Convergence Interpreting Coefficients and Odds Ratios Handling Linearly Separable Data Class Weights and Cost-Sensitive Learning Baseline and Dummy Classifiers Naive Bayes Classifiers Overfitting in Logistic Models Sparse High-Dimensional Settings

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Classification I: Logistic Regression and Probabilistic View

Classification I: Logistic Regression and Probabilistic View

23095 views

Model class probabilities with logistic regression and related probabilistic classifiers.

Content

3 of 15

Maximum Likelihood Estimation

MLE: Likelihood With Attitude

6659 views

intermediate

humorous

machine learning

classification

gpt-5-mini

6659 views

Versions:

MLE: Likelihood With Attitude

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Maximum Likelihood Estimation (MLE) — Make the Data Say Itself

"Find the parameter values that make the data you actually saw the most believable." — Your future statistical self

You already know the link function and the logit: we use the sigmoid to turn linear scores into probabilities, and the Bernoulli/Binomial likelihood to express how likely labels are given those probabilities. Now let’s finish the story: how do we pick the parameters of that sigmoid-y machine so the observed labels look most likely? Enter: Maximum Likelihood Estimation (MLE).

What MLE is, in plain (and slightly dramatic) English

MLE chooses parameters that maximize the probability of the observed data under the model.
In logistic regression, the model gives us p(y=1 | x, w) = sigma(w^T x) where sigma(t) = 1 / (1 + e^{-t}). (This is the logit/link you've seen.)
The labels for individual samples are Bernoulli, so the likelihood of a single observation (x_i, y_i) is p(y_i | x_i, w) = sigma(w^T x_i)^{y_i} (1 - sigma(w^T x_i))^{1-y_i}. You saw the Bernoulli likelihood before; now we chain them together across the dataset.

Write the likelihood, take the log, do the math (but not too much)

Dataset: (x_i, y_i) for i = 1..n.

Likelihood:

L(w) = Π_{i=1}^n p(y_i | x_i, w) = Π_{i=1}^n sigma(w^T x_i)^{y_i} (1 - sigma(w^T x_i))^{1-y_i}

Log-likelihood (because multiplication is rude):

ℓ(w) = log L(w) = Σ_{i=1}^n [ y_i log sigma(w^T x_i) + (1 - y_i) log (1 - sigma(w^T x_i)) ]

This negative log-likelihood is exactly the binary cross-entropy loss commonly used in ML frameworks. So when you write loss = -sum(y * log(p) + (1-y) * log(1-p)), congratulations: you are doing MLE.

Optimization: how to get from formula to numbers

We want w_hat = argmax_w ℓ(w) (or equivalently argmin_w -ℓ(w)). Important characteristics:

The negative log-likelihood for logistic regression is convex in w. That means a global minimum exists and optimization is stable — breathe easy.
Common optimizers:
- Gradient descent / SGD (works well for large datasets)
- Newton-Raphson / IRLS (Iteratively Reweighted Least Squares) — faster convergence for moderate-size data because it uses curvature (Hessian)

Gradient (vector):

∇ℓ(w) = Σ_{i=1}^n (y_i - p_i) x_i  where p_i = sigma(w^T x_i)

Hessian (matrix):

H(w) = - Σ_{i=1}^n p_i (1 - p_i) x_i x_i^T

Newton update (one step):

w_new = w_old - H(w_old)^{-1} ∇(-ℓ(w_old))

IRLS uses this to reframe logistic regression as a sequence of weighted least-squares problems — very 19th-century-math-meets-modern-computing.

MLE, regularization, and the MAP perspective — the bridge to Regression II

Remember when in Regression II we used ridge to keep coefficients from going wild? Regularization has a probabilistic interpretation.

Put a Gaussian prior on w: p(w) ∝ exp(-λ ||w||^2 / 2). Maximizing the posterior p(w | data) is equivalent to maximizing the likelihood plus a penalty: this is MAP, not MLE.
That means ridge regression = MAP estimate with Gaussian prior. In logistic regression, adding λ ||w||^2 to the loss is the same idea — keep parameters conservative, prevent overfitting, improve generalization.

Table: Likelihood vs Penalized Likelihood

Objective	Equivalent to	Effect
`-ℓ(w)`	MLE	Fit model to data only
`-ℓ(w) + λ		w

So the previous section on regularization is not an unrelated laundry list — it’s the same probabilistic framework with a little prior spice.

Intuition, examples, and practical gotchas

Why MLE? Because we want parameters that make the observed labels most probable. It’s like tuning the knobs so the model could have plausibly produced what we actually saw.
Calibration: MLE-trained logistic models give well-calibrated probabilities (often) because they are trained to match observed frequencies. But beware of class imbalance — the likelihood will prioritize fitting the majority class unless you weight or adjust.
Class imbalance fix: weighted log-likelihood

ℓ(w) = Σ w_i [ y_i log p_i + (1-y_i) log(1-p_i) ]

Overfitting fix: regularize. (See the bridge above.)
Numerical stability: compute log sigma and log(1 - sigma) carefully (use log-sum-exp tricks or stable sigmoid computations).

Asymptotics and Uncertainty

MLE is not just a point estimate. Under regular conditions:

w_hat is consistent and asymptotically normal: w_hat ~ N(w_true, I(w_true)^{-1} / n), where I is the Fisher information.
The covariance of the estimate is approximately (-H(w_hat))^{-1}. That’s how you get standard errors and Wald tests for coefficients.

In short: MLE gives you parameters and a built-in way to quantify how sure you are about them.

Algorithm sketch (IRLS / Newton for logistic)

Initialize w = 0 (or small random)
Repeat until convergence:
- Compute p_i = sigma(w^T x_i)
- Form weight matrix W = diag(p_i (1-p_i))
- Compute working response z = X w + W^{-1} (y - p)
- Solve weighted least squares w_new = (X^T W X)^{-1} X^T W z
Output w

(If n is huge, use SGD instead. If X is poorly scaled, standardize first.)

Quick checklist before you run MLE on your next dataset

Standardize features (helps optimization and priors make sense)
Consider class weights for imbalanced labels
Add regularization (λ) — tune by cross-validation
Use appropriate optimizer: SGD for scale, Newton/IRLS for speed on small-medium data
Compute standard errors via the Hessian or bootstrap if you care about inference

Final mic-drop takeaways

MLE = choose parameters that maximize the probability of the observed labels. For logistic regression this is exactly what binary cross-entropy does.
Convex problem → global optimum, lots of nice math, practical stability.
Regularization = MAP: regularized logistic regression is just Bayesian common sense in optimization clothing.
Optimization choices: SGD for big data, Newton/IRLS for small-medium. Watch scaling, balance, and numerical stability.

Now go forth: fit models, question assumptions, and when someone asks whether you used MLE, answer with confidence and a faintly theatrical bow.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics