jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Supervised Machine Learning: Regression and Classification
Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

Bernoulli and Binomial LikelihoodLink Functions and the LogitMaximum Likelihood EstimationRegularized Logistic RegressionDecision Boundaries and GeometryOne-vs-Rest and Multinomial LogisticClass Probability EstimationFeature Scaling and ConvergenceInterpreting Coefficients and Odds RatiosHandling Linearly Separable DataClass Weights and Cost-Sensitive LearningBaseline and Dummy ClassifiersNaive Bayes ClassifiersOverfitting in Logistic ModelsSparse High-Dimensional Settings

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Classification I: Logistic Regression and Probabilistic View

Classification I: Logistic Regression and Probabilistic View

23088 views

Model class probabilities with logistic regression and related probabilistic classifiers.

Content

6 of 15

One-vs-Rest and Multinomial Logistic

Multiclass Logistic: Sass & Softmax
3770 views
intermediate
humorous
machine learning
classification
gpt-5-mini
3770 views

Versions:

Multiclass Logistic: Sass & Softmax

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

One-vs-Rest and Multinomial Logistic — Where Binary Meets the Party

"If logistic regression is your sober friend who’s perfect at saying yes/no, One-vs-Rest is the friend who throws multiple solo opinions into the group chat. Multinomial softmax? That's the friend who mediates and forces everyone to agree on probabilities."

You're coming in hot, already familiar with linear decision boundaries and how regularization tames wild weight magnitudes (yes, we saw that in Decision Boundaries and Geometry and Regularized Logistic Regression). Now we escalate: how do we go from a polite binary classifier to handling many classes without turning your model into a drunken game of rock-paper-scissors? Welcome to the clash (and collaboration) of One-vs-Rest (OvR) and Multinomial (Softmax) Logistic Regression.


Quick reminder (so we don't re-teach the same thing)

  • In binary logistic regression we model P(y=1|x) = sigmoid(w^T x). Decision boundaries are linear (w^T x + b = 0). You already know how L2/L1 regularization keeps weights sane.
  • Multiclass needs either many binaries or one joint model. Both have geometry implications: linear boundaries, but the arrangement and calibration differ.

The two main approaches (spoiler: pick your battles)

1) One-vs-Rest (OvR) — The DIY multiclass

  • Idea: For K classes, train K independent binary classifiers. For class k train: y_k = 1 if class==k, else 0.
  • At prediction: compute each classifier's score (or probability), then choose argmax:
for k in 1..K: score_k = w_k^T x  (or p_k = sigmoid(score_k))
predict = argmax_k score_k  (or argmax_k p_k)

Pros:

  • Simple conceptually and implementation-easy.
  • Highly parallelizable — train each classifier independently.
  • Works well when classes are nearly linearly separable from the rest.

Cons:

  • Probabilities from independent sigmoids are not normalized: sum_k p_k != 1, so they can be poorly calibrated.
  • Overlapping decisions can cause ties/contradictions (two classifiers both strongly positive).
  • If classes are imbalanced or mutually exclusive in subtle ways, OvR can mislead.

Geometric intuition: Each classifier cuts the space with a hyperplane. The region predicted for class k is where its linear score outranks the rest (not necessarily where p_k > 0.5). Decision regions are intersections of multiple half-spaces; boundaries can be weird and non-transitive.


2) Multinomial Logistic Regression (Softmax) — Single model, socially normalized

  • Idea: Model a vector of scores s = W^T x (size K). Convert scores to probabilities with softmax:
P(y=k|x) = softmax_k(s) = \frac{e^{s_k}}{\sum_{j=1}^K e^{s_j}}.
  • We train W jointly by minimizing the multiclass cross-entropy (negative log-likelihood):
L(W)= -\sum_{i} \sum_{k} y_{i,k} \log P(y=k|x_i)

(Where y_{i,k} is 1 if example i has class k else 0.)

Pros:

  • Probabilities are normalized and coherent (they sum to 1). Better calibration.
  • Trains jointly: model learns inter-class geometry. Often gives better performance when classes compete.
  • Gradient has a compact matrix form: grad_W = X^T (P - Y). Nice for vectorized implementations.

Cons:

  • Slightly more complex to implement from scratch (but modern libraries do it for you).
  • Training is a single optimization problem — less trivially parallel across classes.

Geometric intuition: Each class corresponds to a weight vector w_k. The decision boundaries are where two linear scores tie: w_i^T x = w_j^T x, which are hyperplanes between classes. Because probabilities are normalized, the model learns relative scores — pushing one class up implicitly pushes others down.


When to use which? (Practical cheat sheet)

Situation Use OvR Use Softmax/Multinomial
Many classes, quick baseline or highly parallel training ✅ ⚠️ (works but single model)
Need calibrated probabilities that sum to 1 ❌ ✅
Classes are nicely separable individually vs rest ✅ ✅
Classes are mutually exclusive and you want competition modeled ⚠️ ✅
Extremely imbalanced classes ⚠️ (can suffer) ✅ (with class weights)

A tiny numeric intuition — probabilities that don't play nice

Imagine three classes with raw scores from OvR sigmoids: p = [0.9, 0.8, 0.7]. Those are three independent "I love my class" numbers; they don't sum to 1, and they can't all be right. Softmax transforms score vector s = [2,1,0] into normalized probs:

softmax([2,1,0]) = [e^2/(e^2+e^1+e^0), e^1/..., e^0/...] ≈ [0.67, 0.24, 0.09]

Now that’s accountability: one winner, others get proportionally lowered.


Regularization, geometry, and what you learned previously

Remember how L2 shrinks weight norms and thus flattens decision boundaries? Same deal here:

  • With OvR you regularize each binary classifier separately (e.g., L2 on each w_k). This controls overfitting per classifier but may not globally coordinate the class geometry.
  • With multinomial softmax you regularize the whole weight matrix W (often Frobenius norm, i.e., sum of squares across entries). That lets regularization influence inter-class relationships: preventing one class from dominating by huge weights that warp all decision regions.

If you used elastic net or specialized regularizers in Regression II to control sparsity/structure, you can do the same in multiclass — e.g., group lasso to encourage feature selection consistent across classes.


Training recipes & pseudocode

OvR (parallelizable):

for k in 1..K (parallel):
  train binary logistic classifier on labels (y==k) with chosen regularizer
predict: argmax_k w_k^T x

Multinomial (joint):

Initialize W (K x D)
repeat until convergence:
  compute scores S = X W^T (N x K)
  compute P = softmax(S) (row-wise)
  compute gradient G = X^T (P - Y)  # shape D x K
  update W <- W - eta*(G + lambda*W)

Common gotchas and spicy questions

  • Q: "Is OvR broken because probabilities don't sum to 1?" A: Not broken — just uncalibrated. If you only need argmax, it can be fine. If you need reliable probabilities (e.g., for downstream decision-making), prefer softmax.
  • Q: "Can OvR and softmax give different predictions?" A: Absolutely. Because OvR classifiers are trained independently, their relative scaling can change argmax.
  • Q: "Which is faster?" A: OvR can be faster if you parallelize, but softmax tends to be more sample-efficient since it shares information across classes.

Wrap-up — TL;DR with a mic drop

  • One-vs-Rest: Simple, parallel, and useful as a quick baseline. Treat each class like its own little binary kingdom. But its probabilities are independent and can lie to you.
  • Multinomial (Softmax): One joint model that enforces consistency and calibrated class probabilities. Better when classes compete or you care about probabilities.

Final thought: If classification were a dinner party, OvR is everyone shouting their resume at the host. Multinomial is the host quietly weighing everybody’s merits and handing out one well-justified award.

Key takeaways:

  • OvR = K binary models; Multinomial = single vector-valued model with softmax.
  • Softmax yields normalized probabilities and joint training; OvR is simpler and more parallel.
  • Regularization and geometric intuition carry over — but the way weight norms affect inter-class geometry differs between the two.

Next up (if we were continuing the course): explore calibration techniques (Platt scaling, isotonic regression), and structured regularizers (group lasso) to control class-specific sparsity — because your model should win the contest and not just shout the loudest.

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics