Foundations of Supervised Learning
Core concepts, goals, trade-offs, and terminology that underpin regression and classification.
Content
Empirical Risk Minimization
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Empirical Risk Minimization — The Empirical Hustle (But Make It Principled)
"Don't trust the training set — it's like trusting your ex: compelling, biased, and not the whole story."
You already met the troublemakers: underfitting and overfitting (we saw how too-simple models miss patterns and too-complex ones hallucinate them), and you already wrestled with the bias–variance trade-off (the eternal tug-of-war between systematic error and noisy sensitivity). Empirical Risk Minimization (ERM) is the core principle that sits under those phenomena. It’s both the simplest idea in supervised learning and the reason we need complexity control, regularization, and cross-validation — all the things that save us from beautiful but useless models.
What is ERM, in plain-ish English
- Risk (true risk): the expected loss of a predictor f over the unknown data distribution P(X, Y). Symbolically,
R(f) = E_{(X,Y)~P}[L(Y, f(X))]
- Empirical risk: what we can actually compute from our finite training set of n examples:
R_n(f) = (1/n) * sum_{i=1..n} L(y_i, f(x_i))
ERM says: pick the function f in your hypothesis class F that minimizes empirical risk R_n(f). That’s it. No ceremony. No occult rituals. Just minimize the average loss on the data you have.
Bold claim: ERM is the algorithmic manifestation of ‘learn from examples’. But it’s blunt — it will happily overfit if you let it.
Why ERM alone is both brilliant and dangerous
- Brilliant: It’s computationally straightforward (reduce to optimization), conceptually simple, and often works when your hypothesis class is appropriate.
- Dangerous: Minimizing R_n can chase noise. If F is too large/expressive, the minimizer of R_n may be a perfect fit to training labels but terrible out-of-sample. That’s exactly where overfitting comes from — remember Position 4 in Foundations: Underfitting and Overfitting.
Quick mental image: ERM is like copying the exact text of your study notes because they look good in front of you. But the exam (true distribution) asks slightly different questions.
How ERM links to bias and variance (Position 3 revisit)
- If F is tiny (low capacity), ERM yields a high-bias but low-variance predictor. It can't fit complex patterns — underfitting.
- If F is huge (high capacity), ERM can reduce bias (can fit training data tightly) but variance explodes — predictions wobble wildly with new samples.
So ERM is the stage; bias and variance are the actors. We control the play by choosing F or adding regularization.
Practical fixes: Penalized ERM and Structural Risk Minimization
Two siblings of ERM that keep it honest:
- Penalized ERM (aka regularization):
min_f R_n(f) + λ · Ω(f)
Ω(f) is a complexity penalty (e.g., ||w||^2 for linear models). λ tunes the bias–variance trade-off: bigger λ → simpler model (more bias, less variance).
- Structural Risk Minimization (SRM): Arrange hypothesis classes F_1 ⊂ F_2 ⊂ ... and choose the class that minimizes a bound on true risk (Vapnik's idea). Equivalent intuition: constrain capacity first, then do ERM.
Both approaches inject prior preference for simpler functions — because simpler rules generalize better unless data says otherwise.
When does ERM actually work? (A bit of theory — no PhD required)
ERM is consistent if the empirical risk converges uniformly to the true risk over the hypothesis class F. In plain terms:
- If sup_{f in F} |R_n(f) - R(f)| → 0 as n → ∞ (uniform convergence), then the ERM minimizer approaches the best-in-class true-risk minimizer.
Tools that quantify this convergence: VC dimension, Rademacher complexity, covering numbers. Short story: smaller complexity → faster convergence → ERM is safer.
Surrogate losses, classification, and convexity
For classification, the 0–1 loss is the true target but it's non-convex and hard to optimize. ERM with 0–1 loss is intractable, so we use surrogate losses (hinge loss for SVMs, logistic loss for logistic regression). This is still ERM — just with a different loss that is more optimization-friendly and often offers good generalization guarantees.
A simple worked example (polynomial regression)
Imagine trying to fit y = sin(x) + noise with polynomials of degree d. ERM on training set will produce:
- d small → underfit (high bias)
- d large → near-zero training error but terrible on test (high variance)
Regularized ERM with a penalty on polynomial coefficients (ridge) nudges the solution toward smoother polynomials and reduces variance.
Pseudocode sketch:
for degree in 0..D:
fit polynomial of degree with regularization λ via ERM
estimate validation error
select model with smallest validation error
Cross-validation is the practical way to estimate generalization risk when you don’t have infinite data.
A tiny comparison table
| Quantity | Meaning | Danger / Use |
|---|---|---|
| True risk R(f) | Expected loss over P | Gold standard (unknown) |
| Empirical risk R_n(f) | Average loss on training set | What ERM minimizes — can mislead if n small or F huge |
| Generalization gap | R(f) − R_n(f) | We want this small; complexity control helps |
Quick checklist: Using ERM well
- Choose a hypothesis class with appropriate capacity (not too small, not monstrous).
- Use regularization (penalized ERM) to control variance.
- Use surrogate losses when optimization of the true loss is infeasible.
- Validate: cross-validation or holdout sets to estimate generalization.
- Watch learning curves: if training and validation errors both high → increase capacity; if training error low but validation high → add regularization or reduce capacity.
Closing rant/insight
ERM is the workhorse of supervised learning: simple, powerful, and blunt. It’s the rule that says ‘fit what you see’, but generalization is the art of knowing when not to trust what you see. The real craft is balancing the class capacity, the penalty, the loss, and the data. Think of ERM as the baseline recipe — you can cook a decent meal with it, but seasoning (regularization, validation) makes it edible for anyone other than the training set.
Key takeaways:
- ERM minimizes empirical risk; good in principle, risky in practice without capacity control.
- Regularization and SRM are ways to keep ERM from overfitting — this is directly tied to the bias–variance trade-off.
- Uniform convergence/complexity measures tell you when ERM is theoretically safe.
Final thought: If ERM were a person, it would be the friend who repeats what everyone at the party says. Useful for gossip, disastrous if you want original insight. Train it well.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!