Foundations of Supervised Learning
Core concepts, goals, trade-offs, and terminology that underpin regression and classification.
Content
Loss Functions Overview
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Loss Functions Overview — Why your model cries when it’s wrong (and sometimes when it’s right)
"Pick your loss like you pick your battles: strategically, and with the memory of past trauma (outliers)."
You already know Empirical Risk Minimization (ERM): we choose a model that minimizes average loss on the training set. You also learned about underfitting and overfitting — the classic tug-of-war between bias and variance. Loss functions are the battleground. They determine what mistakes matter, how harshly we punish them, and how nicely optimization algorithms behave. This page takes you on a tour of the most common loss functions, why they exist, and what they mean for performance and generalization.
Quick map: what we want from a loss
- Signal: Loss must reflect modelling goals (e.g., penalize confidence mistakes in classification).
- Optimizable: Prefer convex and differentiable losses for easier training.
- Robustness: Some losses shrug off outliers; others get torqued by them.
- Probabilistic meaning: Some losses correspond to maximum likelihood under a noise model.
Think of loss as the referee’s whistle: loud and clear when a foul happens (large error), and consistent so players learn to play better.
Part A — Regression losses (real-valued targets)
1) Squared error / Mean Squared Error (MSE)
- Definition: L(y, ŷ) = (y − ŷ)^2 (or averaged over dataset → MSE)
- Intuition: Penalizes big errors heavily (quadratic). It’s the drama queen of losses.
- Properties: Convex, differentiable, corresponds to Gaussian noise assumption (MLE).
- When to use: Standard regression, when you want to penalize large deviations.
- Caveat: Sensitive to outliers.
2) Mean Absolute Error (MAE / L1)
- Definition: L(y, ŷ) = |y − ŷ|
- Intuition: Treats errors linearly. Less dramatic than MSE.
- Properties: Convex, but nondifferentiable at 0 (subgradients exist). Corresponds to Laplace noise (MLE).
- When to use: If you want robustness to outliers or care about median behavior.
3) Huber loss — best of both worlds
- Definition: Quadratic near zero error, linear past a threshold δ.
- Intuition: Be gentle on small errors (MSE-like), but stop letting huge errors dominate (MAE-like).
- When to use: You want differentiability and robustness.
4) Quantile loss
- Used when you care about predicting a quantile (e.g., 90th percentile demand). Useful for heteroskedastic noise.
Part B — Classification losses (discrete labels)
1) 0–1 loss (the ground truth, but impractical)
- Definition: L(y, ŷ) = 1 if misclassified, else 0.
- Intuition: Exactly what we evaluate at test time for accuracy, but it’s discontinuous and nonconvex → terrible for optimization.
- Use: Conceptual; not used for gradient-based training.
2) Logistic loss / Cross-Entropy
- Definition (binary): L = −[y log p + (1−y) log(1−p)] where p = σ(f(x))
- Intuition: Punishes confident but wrong predictions heavily.
- Properties: Convex in the linear predictor for binary logistic, differentiable, probabilistic (MLE under Bernoulli).
- When to use: Default for probabilistic classifiers and neural nets for classification.
3) Hinge loss (SVM family)
- Definition: L(y, f) = max(0, 1 − y f)
- Intuition: Wants not only correct classification but a margin of confidence.
- Properties: Convex, but not differentiable at the hinge; encourages large margins.
- When to use: Support vector machines and margin-based learning.
4) Softmax + Categorical Cross-Entropy
- Definition: Multi-class generalization of logistic loss. Softmax converts scores to probabilities; cross-entropy compares to one-hot labels.
- When to use: Standard in multi-class classification.
5) Focal loss (for class imbalance)
- Why: Down-weights well-classified examples so the model focuses on hard, minority-class examples.
- When to use: Highly imbalanced datasets (e.g., rare object detection).
Quick table: property cheat-sheet
| Loss | Convex? | Differentiable? | Robust to outliers? | Probabilistic meaning |
|---|---|---|---|---|
| MSE | Yes | Yes | No | Gaussian noise (MLE) |
| MAE | Yes | Subgradients | Yes | Laplace noise (MLE) |
| Huber | Yes | Yes | Moderately | — |
| 0–1 | No | No | — | — |
| Logistic / Cross-Entropy | Often (w.r.t. scores) | Yes | Not robust to label noise | Bernoulli / Categorical MLE |
| Hinge | Yes | Subgradient | No | Margin-based view |
How loss choice links to ERM and over/underfitting
Remember ERM: empirical risk = average loss on training data. When you switch losses, you change the objective landscape — that affects model capacity and the kind of errors the optimizer prioritizes.
- A large-outlier-sensitive loss (MSE) can drive the model to overfit to those outliers, increasing variance.
- A robust loss (MAE, Huber) can reduce sensitivity to noisy points, which may improve generalization in messy real-world data.
- Strong margin losses (hinge) implicitly regularize by demanding confident separation.
Choosing a loss is part of the modeler’s toolkit for balancing bias and variance. If you obsess only about model class (polynomial degree, tree depth) but ignore loss, you’re missing half the picture.
Optimization & practicality
- Convex + smooth losses → easier guarantees, convex solvers or stable gradient descent.
- Nonconvex losses (or nonconvex models like deep nets) rely on optimization heuristics; the shape of the loss still matters for convergence speed.
Pseudocode: one gradient step for parameter θ under loss L
# gradient descent step
θ ← θ − η * (1/N) * Σ_i ∇_θ L(y_i, f(x_i; θ))
If ∇_θ L is large from an outlier (e.g., squared loss), that single point can hijack updates.
Practical heuristics & checklist
- If your dataset has clear outliers or heavy tails, try MAE or Huber.
- For classification problems where you want probabilities, use cross-entropy.
- If classes are heavily imbalanced, consider focal loss or class-weighted cross-entropy.
- Want margins and interpretability? Hinge loss/SVMs are useful.
- Always monitor not just training loss but validation performance related to your real metric (accuracy, F1, MAE, etc.).
Closing — TL;DR and a moral
- Loss functions encode priorities: they tell the model what mistakes are sins and what are misdemeanors.
- They interact with ERM and regularization: choosing a loss is as important as choosing model complexity for avoiding under/overfitting.
- Optimization reality: prefer differentiable, well-behaved losses when you rely on gradient-based training, but don’t shy from hybrid losses (Huber) when data is messy.
Parting thought: if your model were a student, the loss is the syllabus. Make sure it grades what actually matters. Messy syllabus → confused student → weird exam performance.
Next up: pick one of these losses and we’ll practice — compare training curves, inspect gradients, and watch how outliers either get bullied or coddled. Ready to pick favorites and fight over them like academic roommates?
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!