jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Supervised Machine Learning: Regression and Classification
Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

Bernoulli and Binomial LikelihoodLink Functions and the LogitMaximum Likelihood EstimationRegularized Logistic RegressionDecision Boundaries and GeometryOne-vs-Rest and Multinomial LogisticClass Probability EstimationFeature Scaling and ConvergenceInterpreting Coefficients and Odds RatiosHandling Linearly Separable DataClass Weights and Cost-Sensitive LearningBaseline and Dummy ClassifiersNaive Bayes ClassifiersOverfitting in Logistic ModelsSparse High-Dimensional Settings

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Classification I: Logistic Regression and Probabilistic View

Classification I: Logistic Regression and Probabilistic View

23088 views

Model class probabilities with logistic regression and related probabilistic classifiers.

Content

3 of 15

Maximum Likelihood Estimation

MLE: Likelihood With Attitude
6659 views
intermediate
humorous
machine learning
classification
gpt-5-mini
6659 views

Versions:

MLE: Likelihood With Attitude

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Maximum Likelihood Estimation (MLE) — Make the Data Say Itself

"Find the parameter values that make the data you actually saw the most believable." — Your future statistical self

You already know the link function and the logit: we use the sigmoid to turn linear scores into probabilities, and the Bernoulli/Binomial likelihood to express how likely labels are given those probabilities. Now let’s finish the story: how do we pick the parameters of that sigmoid-y machine so the observed labels look most likely? Enter: Maximum Likelihood Estimation (MLE).


What MLE is, in plain (and slightly dramatic) English

  • MLE chooses parameters that maximize the probability of the observed data under the model.
  • In logistic regression, the model gives us p(y=1 | x, w) = sigma(w^T x) where sigma(t) = 1 / (1 + e^{-t}). (This is the logit/link you've seen.)
  • The labels for individual samples are Bernoulli, so the likelihood of a single observation (x_i, y_i) is p(y_i | x_i, w) = sigma(w^T x_i)^{y_i} (1 - sigma(w^T x_i))^{1-y_i}. You saw the Bernoulli likelihood before; now we chain them together across the dataset.

Write the likelihood, take the log, do the math (but not too much)

Dataset: (x_i, y_i) for i = 1..n.

Likelihood:

L(w) = Π_{i=1}^n p(y_i | x_i, w) = Π_{i=1}^n sigma(w^T x_i)^{y_i} (1 - sigma(w^T x_i))^{1-y_i}

Log-likelihood (because multiplication is rude):

ℓ(w) = log L(w) = Σ_{i=1}^n [ y_i log sigma(w^T x_i) + (1 - y_i) log (1 - sigma(w^T x_i)) ]

This negative log-likelihood is exactly the binary cross-entropy loss commonly used in ML frameworks. So when you write loss = -sum(y * log(p) + (1-y) * log(1-p)), congratulations: you are doing MLE.


Optimization: how to get from formula to numbers

We want w_hat = argmax_w ℓ(w) (or equivalently argmin_w -ℓ(w)). Important characteristics:

  • The negative log-likelihood for logistic regression is convex in w. That means a global minimum exists and optimization is stable — breathe easy.
  • Common optimizers:
    • Gradient descent / SGD (works well for large datasets)
    • Newton-Raphson / IRLS (Iteratively Reweighted Least Squares) — faster convergence for moderate-size data because it uses curvature (Hessian)

Gradient (vector):

∇ℓ(w) = Σ_{i=1}^n (y_i - p_i) x_i  where p_i = sigma(w^T x_i)

Hessian (matrix):

H(w) = - Σ_{i=1}^n p_i (1 - p_i) x_i x_i^T

Newton update (one step):

w_new = w_old - H(w_old)^{-1} ∇(-ℓ(w_old))

IRLS uses this to reframe logistic regression as a sequence of weighted least-squares problems — very 19th-century-math-meets-modern-computing.


MLE, regularization, and the MAP perspective — the bridge to Regression II

Remember when in Regression II we used ridge to keep coefficients from going wild? Regularization has a probabilistic interpretation.

  • Put a Gaussian prior on w: p(w) ∝ exp(-λ ||w||^2 / 2). Maximizing the posterior p(w | data) is equivalent to maximizing the likelihood plus a penalty: this is MAP, not MLE.

  • That means ridge regression = MAP estimate with Gaussian prior. In logistic regression, adding λ ||w||^2 to the loss is the same idea — keep parameters conservative, prevent overfitting, improve generalization.

Table: Likelihood vs Penalized Likelihood

Objective Equivalent to Effect
-ℓ(w) MLE Fit model to data only
`-ℓ(w) + λ w

So the previous section on regularization is not an unrelated laundry list — it’s the same probabilistic framework with a little prior spice.


Intuition, examples, and practical gotchas

  • Why MLE? Because we want parameters that make the observed labels most probable. It’s like tuning the knobs so the model could have plausibly produced what we actually saw.

  • Calibration: MLE-trained logistic models give well-calibrated probabilities (often) because they are trained to match observed frequencies. But beware of class imbalance — the likelihood will prioritize fitting the majority class unless you weight or adjust.

  • Class imbalance fix: weighted log-likelihood

ℓ(w) = Σ w_i [ y_i log p_i + (1-y_i) log(1-p_i) ]
  • Overfitting fix: regularize. (See the bridge above.)

  • Numerical stability: compute log sigma and log(1 - sigma) carefully (use log-sum-exp tricks or stable sigmoid computations).


Asymptotics and Uncertainty

MLE is not just a point estimate. Under regular conditions:

  • w_hat is consistent and asymptotically normal: w_hat ~ N(w_true, I(w_true)^{-1} / n), where I is the Fisher information.
  • The covariance of the estimate is approximately (-H(w_hat))^{-1}. That’s how you get standard errors and Wald tests for coefficients.

In short: MLE gives you parameters and a built-in way to quantify how sure you are about them.


Algorithm sketch (IRLS / Newton for logistic)

  1. Initialize w = 0 (or small random)
  2. Repeat until convergence:
    • Compute p_i = sigma(w^T x_i)
    • Form weight matrix W = diag(p_i (1-p_i))
    • Compute working response z = X w + W^{-1} (y - p)
    • Solve weighted least squares w_new = (X^T W X)^{-1} X^T W z
  3. Output w

(If n is huge, use SGD instead. If X is poorly scaled, standardize first.)


Quick checklist before you run MLE on your next dataset

  • Standardize features (helps optimization and priors make sense)
  • Consider class weights for imbalanced labels
  • Add regularization (λ) — tune by cross-validation
  • Use appropriate optimizer: SGD for scale, Newton/IRLS for speed on small-medium data
  • Compute standard errors via the Hessian or bootstrap if you care about inference

Final mic-drop takeaways

  • MLE = choose parameters that maximize the probability of the observed labels. For logistic regression this is exactly what binary cross-entropy does.
  • Convex problem → global optimum, lots of nice math, practical stability.
  • Regularization = MAP: regularized logistic regression is just Bayesian common sense in optimization clothing.
  • Optimization choices: SGD for big data, Newton/IRLS for small-medium. Watch scaling, balance, and numerical stability.

Now go forth: fit models, question assumptions, and when someone asks whether you used MLE, answer with confidence and a faintly theatrical bow.

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics