Classification I: Logistic Regression and Probabilistic View
Model class probabilities with logistic regression and related probabilistic classifiers.
Content
One-vs-Rest and Multinomial Logistic
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
One-vs-Rest and Multinomial Logistic — Where Binary Meets the Party
"If logistic regression is your sober friend who’s perfect at saying yes/no, One-vs-Rest is the friend who throws multiple solo opinions into the group chat. Multinomial softmax? That's the friend who mediates and forces everyone to agree on probabilities."
You're coming in hot, already familiar with linear decision boundaries and how regularization tames wild weight magnitudes (yes, we saw that in Decision Boundaries and Geometry and Regularized Logistic Regression). Now we escalate: how do we go from a polite binary classifier to handling many classes without turning your model into a drunken game of rock-paper-scissors? Welcome to the clash (and collaboration) of One-vs-Rest (OvR) and Multinomial (Softmax) Logistic Regression.
Quick reminder (so we don't re-teach the same thing)
- In binary logistic regression we model P(y=1|x) = sigmoid(w^T x). Decision boundaries are linear (w^T x + b = 0). You already know how L2/L1 regularization keeps weights sane.
- Multiclass needs either many binaries or one joint model. Both have geometry implications: linear boundaries, but the arrangement and calibration differ.
The two main approaches (spoiler: pick your battles)
1) One-vs-Rest (OvR) — The DIY multiclass
- Idea: For K classes, train K independent binary classifiers. For class k train: y_k = 1 if class==k, else 0.
- At prediction: compute each classifier's score (or probability), then choose argmax:
for k in 1..K: score_k = w_k^T x (or p_k = sigmoid(score_k))
predict = argmax_k score_k (or argmax_k p_k)
Pros:
- Simple conceptually and implementation-easy.
- Highly parallelizable — train each classifier independently.
- Works well when classes are nearly linearly separable from the rest.
Cons:
- Probabilities from independent sigmoids are not normalized: sum_k p_k != 1, so they can be poorly calibrated.
- Overlapping decisions can cause ties/contradictions (two classifiers both strongly positive).
- If classes are imbalanced or mutually exclusive in subtle ways, OvR can mislead.
Geometric intuition: Each classifier cuts the space with a hyperplane. The region predicted for class k is where its linear score outranks the rest (not necessarily where p_k > 0.5). Decision regions are intersections of multiple half-spaces; boundaries can be weird and non-transitive.
2) Multinomial Logistic Regression (Softmax) — Single model, socially normalized
- Idea: Model a vector of scores s = W^T x (size K). Convert scores to probabilities with softmax:
P(y=k|x) = softmax_k(s) = \frac{e^{s_k}}{\sum_{j=1}^K e^{s_j}}.
- We train W jointly by minimizing the multiclass cross-entropy (negative log-likelihood):
L(W)= -\sum_{i} \sum_{k} y_{i,k} \log P(y=k|x_i)
(Where y_{i,k} is 1 if example i has class k else 0.)
Pros:
- Probabilities are normalized and coherent (they sum to 1). Better calibration.
- Trains jointly: model learns inter-class geometry. Often gives better performance when classes compete.
- Gradient has a compact matrix form: grad_W = X^T (P - Y). Nice for vectorized implementations.
Cons:
- Slightly more complex to implement from scratch (but modern libraries do it for you).
- Training is a single optimization problem — less trivially parallel across classes.
Geometric intuition: Each class corresponds to a weight vector w_k. The decision boundaries are where two linear scores tie: w_i^T x = w_j^T x, which are hyperplanes between classes. Because probabilities are normalized, the model learns relative scores — pushing one class up implicitly pushes others down.
When to use which? (Practical cheat sheet)
| Situation | Use OvR | Use Softmax/Multinomial |
|---|---|---|
| Many classes, quick baseline or highly parallel training | ✅ | ⚠️ (works but single model) |
| Need calibrated probabilities that sum to 1 | ❌ | ✅ |
| Classes are nicely separable individually vs rest | ✅ | ✅ |
| Classes are mutually exclusive and you want competition modeled | ⚠️ | ✅ |
| Extremely imbalanced classes | ⚠️ (can suffer) | ✅ (with class weights) |
A tiny numeric intuition — probabilities that don't play nice
Imagine three classes with raw scores from OvR sigmoids: p = [0.9, 0.8, 0.7]. Those are three independent "I love my class" numbers; they don't sum to 1, and they can't all be right. Softmax transforms score vector s = [2,1,0] into normalized probs:
softmax([2,1,0]) = [e^2/(e^2+e^1+e^0), e^1/..., e^0/...] ≈ [0.67, 0.24, 0.09]
Now that’s accountability: one winner, others get proportionally lowered.
Regularization, geometry, and what you learned previously
Remember how L2 shrinks weight norms and thus flattens decision boundaries? Same deal here:
- With OvR you regularize each binary classifier separately (e.g., L2 on each w_k). This controls overfitting per classifier but may not globally coordinate the class geometry.
- With multinomial softmax you regularize the whole weight matrix W (often Frobenius norm, i.e., sum of squares across entries). That lets regularization influence inter-class relationships: preventing one class from dominating by huge weights that warp all decision regions.
If you used elastic net or specialized regularizers in Regression II to control sparsity/structure, you can do the same in multiclass — e.g., group lasso to encourage feature selection consistent across classes.
Training recipes & pseudocode
OvR (parallelizable):
for k in 1..K (parallel):
train binary logistic classifier on labels (y==k) with chosen regularizer
predict: argmax_k w_k^T x
Multinomial (joint):
Initialize W (K x D)
repeat until convergence:
compute scores S = X W^T (N x K)
compute P = softmax(S) (row-wise)
compute gradient G = X^T (P - Y) # shape D x K
update W <- W - eta*(G + lambda*W)
Common gotchas and spicy questions
- Q: "Is OvR broken because probabilities don't sum to 1?" A: Not broken — just uncalibrated. If you only need argmax, it can be fine. If you need reliable probabilities (e.g., for downstream decision-making), prefer softmax.
- Q: "Can OvR and softmax give different predictions?" A: Absolutely. Because OvR classifiers are trained independently, their relative scaling can change argmax.
- Q: "Which is faster?" A: OvR can be faster if you parallelize, but softmax tends to be more sample-efficient since it shares information across classes.
Wrap-up — TL;DR with a mic drop
- One-vs-Rest: Simple, parallel, and useful as a quick baseline. Treat each class like its own little binary kingdom. But its probabilities are independent and can lie to you.
- Multinomial (Softmax): One joint model that enforces consistency and calibrated class probabilities. Better when classes compete or you care about probabilities.
Final thought: If classification were a dinner party, OvR is everyone shouting their resume at the host. Multinomial is the host quietly weighing everybody’s merits and handing out one well-justified award.
Key takeaways:
- OvR = K binary models; Multinomial = single vector-valued model with softmax.
- Softmax yields normalized probabilities and joint training; OvR is simpler and more parallel.
- Regularization and geometric intuition carry over — but the way weight norms affect inter-class geometry differs between the two.
Next up (if we were continuing the course): explore calibration techniques (Platt scaling, isotonic regression), and structured regularizers (group lasso) to control class-specific sparsity — because your model should win the contest and not just shout the loudest.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!