Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

Bernoulli and Binomial Likelihood Link Functions and the Logit Maximum Likelihood Estimation Regularized Logistic Regression Decision Boundaries and Geometry One-vs-Rest and Multinomial Logistic Class Probability Estimation Feature Scaling and Convergence Interpreting Coefficients and Odds Ratios Handling Linearly Separable Data Class Weights and Cost-Sensitive Learning Baseline and Dummy Classifiers Naive Bayes Classifiers Overfitting in Logistic Models Sparse High-Dimensional Settings

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Classification I: Logistic Regression and Probabilistic View

Classification I: Logistic Regression and Probabilistic View

23100 views

Model class probabilities with logistic regression and related probabilistic classifiers.

Content

6 of 15

One-vs-Rest and Multinomial Logistic

Multiclass Logistic: Sass & Softmax

3770 views

intermediate

humorous

machine learning

classification

gpt-5-mini

3770 views

Versions:

Multiclass Logistic: Sass & Softmax

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

One-vs-Rest and Multinomial Logistic — Where Binary Meets the Party

"If logistic regression is your sober friend who’s perfect at saying yes/no, One-vs-Rest is the friend who throws multiple solo opinions into the group chat. Multinomial softmax? That's the friend who mediates and forces everyone to agree on probabilities."

You're coming in hot, already familiar with linear decision boundaries and how regularization tames wild weight magnitudes (yes, we saw that in Decision Boundaries and Geometry and Regularized Logistic Regression). Now we escalate: how do we go from a polite binary classifier to handling many classes without turning your model into a drunken game of rock-paper-scissors? Welcome to the clash (and collaboration) of One-vs-Rest (OvR) and Multinomial (Softmax) Logistic Regression.

Quick reminder (so we don't re-teach the same thing)

In binary logistic regression we model P(y=1|x) = sigmoid(w^T x). Decision boundaries are linear (w^T x + b = 0). You already know how L2/L1 regularization keeps weights sane.
Multiclass needs either many binaries or one joint model. Both have geometry implications: linear boundaries, but the arrangement and calibration differ.

The two main approaches (spoiler: pick your battles)

1) One-vs-Rest (OvR) — The DIY multiclass

Idea: For K classes, train K independent binary classifiers. For class k train: y_k = 1 if class==k, else 0.
At prediction: compute each classifier's score (or probability), then choose argmax:

for k in 1..K: score_k = w_k^T x  (or p_k = sigmoid(score_k))
predict = argmax_k score_k  (or argmax_k p_k)

Pros:

Simple conceptually and implementation-easy.
Highly parallelizable — train each classifier independently.
Works well when classes are nearly linearly separable from the rest.

Cons:

Probabilities from independent sigmoids are not normalized: sum_k p_k != 1, so they can be poorly calibrated.
Overlapping decisions can cause ties/contradictions (two classifiers both strongly positive).
If classes are imbalanced or mutually exclusive in subtle ways, OvR can mislead.

Geometric intuition: Each classifier cuts the space with a hyperplane. The region predicted for class k is where its linear score outranks the rest (not necessarily where p_k > 0.5). Decision regions are intersections of multiple half-spaces; boundaries can be weird and non-transitive.

2) Multinomial Logistic Regression (Softmax) — Single model, socially normalized

Idea: Model a vector of scores s = W^T x (size K). Convert scores to probabilities with softmax:

P(y=k|x) = softmax_k(s) = \frac{e^{s_k}}{\sum_{j=1}^K e^{s_j}}.

We train W jointly by minimizing the multiclass cross-entropy (negative log-likelihood):

L(W)= -\sum_{i} \sum_{k} y_{i,k} \log P(y=k|x_i)

(Where y_{i,k} is 1 if example i has class k else 0.)

Pros:

Probabilities are normalized and coherent (they sum to 1). Better calibration.
Trains jointly: model learns inter-class geometry. Often gives better performance when classes compete.
Gradient has a compact matrix form: grad_W = X^T (P - Y). Nice for vectorized implementations.

Cons:

Slightly more complex to implement from scratch (but modern libraries do it for you).
Training is a single optimization problem — less trivially parallel across classes.

Geometric intuition: Each class corresponds to a weight vector w_k. The decision boundaries are where two linear scores tie: w_i^T x = w_j^T x, which are hyperplanes between classes. Because probabilities are normalized, the model learns relative scores — pushing one class up implicitly pushes others down.

When to use which? (Practical cheat sheet)

Situation	Use OvR	Use Softmax/Multinomial
Many classes, quick baseline or highly parallel training	✅	⚠️ (works but single model)
Need calibrated probabilities that sum to 1	❌	✅
Classes are nicely separable individually vs rest	✅	✅
Classes are mutually exclusive and you want competition modeled	⚠️	✅
Extremely imbalanced classes	⚠️ (can suffer)	✅ (with class weights)

A tiny numeric intuition — probabilities that don't play nice

Imagine three classes with raw scores from OvR sigmoids: p = [0.9, 0.8, 0.7]. Those are three independent "I love my class" numbers; they don't sum to 1, and they can't all be right. Softmax transforms score vector s = [2,1,0] into normalized probs:

softmax([2,1,0]) = [e^2/(e^2+e^1+e^0), e^1/..., e^0/...] ≈ [0.67, 0.24, 0.09]

Now that’s accountability: one winner, others get proportionally lowered.

Regularization, geometry, and what you learned previously

Remember how L2 shrinks weight norms and thus flattens decision boundaries? Same deal here:

With OvR you regularize each binary classifier separately (e.g., L2 on each w_k). This controls overfitting per classifier but may not globally coordinate the class geometry.
With multinomial softmax you regularize the whole weight matrix W (often Frobenius norm, i.e., sum of squares across entries). That lets regularization influence inter-class relationships: preventing one class from dominating by huge weights that warp all decision regions.

If you used elastic net or specialized regularizers in Regression II to control sparsity/structure, you can do the same in multiclass — e.g., group lasso to encourage feature selection consistent across classes.

Training recipes & pseudocode

OvR (parallelizable):

for k in 1..K (parallel):
  train binary logistic classifier on labels (y==k) with chosen regularizer
predict: argmax_k w_k^T x

Multinomial (joint):

Initialize W (K x D)
repeat until convergence:
  compute scores S = X W^T (N x K)
  compute P = softmax(S) (row-wise)
  compute gradient G = X^T (P - Y)  # shape D x K
  update W <- W - eta*(G + lambda*W)

Common gotchas and spicy questions

Q: "Is OvR broken because probabilities don't sum to 1?" A: Not broken — just uncalibrated. If you only need argmax, it can be fine. If you need reliable probabilities (e.g., for downstream decision-making), prefer softmax.
Q: "Can OvR and softmax give different predictions?" A: Absolutely. Because OvR classifiers are trained independently, their relative scaling can change argmax.
Q: "Which is faster?" A: OvR can be faster if you parallelize, but softmax tends to be more sample-efficient since it shares information across classes.

Wrap-up — TL;DR with a mic drop

One-vs-Rest: Simple, parallel, and useful as a quick baseline. Treat each class like its own little binary kingdom. But its probabilities are independent and can lie to you.
Multinomial (Softmax): One joint model that enforces consistency and calibrated class probabilities. Better when classes compete or you care about probabilities.

Final thought: If classification were a dinner party, OvR is everyone shouting their resume at the host. Multinomial is the host quietly weighing everybody’s merits and handing out one well-justified award.

Key takeaways:

OvR = K binary models; Multinomial = single vector-valued model with softmax.
Softmax yields normalized probabilities and joint training; OvR is simpler and more parallel.
Regularization and geometric intuition carry over — but the way weight norms affect inter-class geometry differs between the two.

Next up (if we were continuing the course): explore calibration techniques (Platt scaling, isotonic regression), and structured regularizers (group lasso) to control class-specific sparsity — because your model should win the contest and not just shout the loudest.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics