Foundations of Supervised Learning
Core concepts, goals, trade-offs, and terminology that underpin regression and classification.
Content
Bias–Variance Trade-off
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
The Bias–Variance Trade-off: Why Your Model Is Either Too Boring or Too Drama
You already know about inputs, targets, and the hypothesis space — congratulations, you have the toolbox. Now let’s decide whether we build a sensible chair or a Rube Goldberg contraption of a chair that collapses three days later.
Hook: The Tale of Two Models
Imagine two models predicting house prices from the same inputs. Model A always predicts the mean price. Model B fits every speck of noise in the training data — outliers, typos, ghosts of agents past. Model A is boring but steady. Model B is impressively specific and catastrophically wrong on new houses.
This is the bias–variance trade-off in a nutshell: simplicity vs flexibility, stability vs adaptability. We balance them to minimize error on new, unseen data — which is the whole point of supervised learning.
What is the bias–variance trade-off? (Short answer)
- Bias measures errors from erroneous assumptions in the learning algorithm. High bias => underfitting.
- Variance measures how much the model fluctuates for different training sets. High variance => overfitting.
- Irreducible noise is the part of the target variability you simply cannot predict from inputs (measurement error, hidden variables).
Mathematically (for squared error):
E[(y − f̂(x))^2] = (Bias[f̂(x)])^2 + Var[f̂(x)] + Noise
This decomposition is your north star when selecting models, hyperparameters, or regularization.
Why this matters (connecting to what you already know)
You’ve seen the hypothesis space idea earlier: the family of functions your learning algorithm can pick from. A tiny hypothesis space (e.g., linear functions) tends to have high bias. A gigantic hypothesis space (e.g., very deep neural networks, high-degree polynomials) tends to have high variance unless tamed.
Also remember the difference between supervised vs unsupervised vs reinforcement: in supervised learning we care about generalizing from labeled examples. Bias and variance are all about generalization error — exactly the metric that separates supervised learning from, say, clustering weirdness.
Visual metaphors and intuition (because pictures are cheating in a good way)
- Think of bias as a systematic error: a miscalibrated ruler that always subtracts 5 cm. No matter how many measurements you take, the error remains.
- Think of variance as the shakiness of your hand. Each time you measure, the reading hops around. Average many shaky measurements and you might be close — but any one measurement can be all over the place.
Imagine throwing darts at a target:
- High bias, low variance: all darts cluster tightly, but far from the bullseye.
- Low bias, high variance: darts scatter around the bullseye — some hit it, many miss.
- Low bias, low variance: direct centered cluster — the dream.
Concrete examples
- Polynomial regression on a nonlinear trend
- Degree 1 (linear): high bias, low variance — underfits.
- Degree 15: low training error, high variance — overfits.
- k-Nearest Neighbors
- k large: smoother predictions, higher bias, lower variance.
- k = 1: model memorizes training points, very low bias but huge variance.
- Decision trees
- Very deep tree: low bias on training set, super high variance.
- Pruned shallow tree: higher bias, lower variance.
Table: quick cheat sheet
| Model complexity | Typical bias | Typical variance | Concrete example |
|---|---|---|---|
| Low complexity | High | Low | Linear regression on complex curvy data |
| Medium | Moderate | Moderate | Regularized regression, pruned tree |
| High complexity | Low | High | Deep tree, high-degree polynomial |
How to measure and act (practical recipes)
Plot learning curves (training vs validation error as function of training size or complexity). They tell you whether you’re underfitting or overfitting.
- If both training and validation error are high and close: increase model capacity (reduce bias).
- If training error is low but validation error is high: reduce variance via regularization, more data, or ensembling.
Cross-validation: your empirical oracle for estimating generalization. Use it to tune complexity.
Pseudocode: simple grid search with CV
for each hyperparameter value h in grid:
train model M_h on training folds
validate on validation fold
select h with smallest avg validation error
retrain M_h on full training data
evaluate on test set
Ways to reduce bias or variance (and the trade-offs)
To reduce bias (combat underfitting):
- Increase model complexity (richer hypothesis space)
- Add more informative features or interactions
- Reduce regularization strength
To reduce variance (combat overfitting):
- Add regularization (Ridge, Lasso) — penalize large weights
- Gather more data (the most surgical tool against variance)
- Use ensembling (bagging reduces variance; boosting reduces bias)
- Simplify the model (prune trees, reduce degree)
Note: Some techniques help both sides in practice. Feature engineering can reduce bias and variance by making patterns more learnable.
Cool nuance: ensembles, bias, and variance
- Bagging (bootstrap aggregating) reduces variance by averaging multiple high-variance models (e.g., many deep trees) — think of averaging many shaky hands to get steadier aim.
- Boosting sequentially reduces bias by focusing on mistakes — it can reduce bias dramatically but sometimes increases variance, so regularization or early stopping is needed.
Common mistakes and misconceptions
- "More complex model is always better if I have enough data" — not true without regularization; complexity also increases the need for data and compute.
- "Low training error means success" — no. Training error says nothing about variance and hence generalization.
- Thinking of bias and variance as properties of the algorithm only — they depend on algorithm + hypothesis space + data distribution.
Quick diagnostic checklist (when your model misbehaves)
- Plot learning curves. Are training/validation errors converging or diverging?
- If underfitting: make model more expressive, add features, reduce regularization.
- If overfitting: add data, use regularization, prune, or ensemble.
- Use cross-validation to confirm your interventions actually reduce validation error.
Closing: the mindset you want
Bias–variance is less a formula and more an aesthetic decision in modeling. You are sculpting a function from finite data. Too rigid: you miss subtlety. Too flexible: you hallucinate patterns. The goal is not to annihilate bias or variance but to balance them for minimal expected error.
Powerful one-liner: Find the simplest model that is complex enough to capture the signal, and be suspicious of models that look like they could win a debating contest with noise.
Key takeaways:
- Decompose error into bias, variance, and noise to guide fixes.
- Tune complexity, regularization, data quantity, and ensembles as levers.
- Always validate with held-out data or cross-validation.
Next up: we’ve discussed hypothesis spaces before — now we’ll apply these insights to concrete algorithms (linear models, trees, SVMs) and practice picking hyperparameters with cross-validated learning curves. Bring snacks.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!