Foundations of Supervised Learning
Core concepts, goals, trade-offs, and terminology that underpin regression and classification.
Content
Underfitting and Overfitting
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Underfitting and Overfitting — The Goldilocks Problem of Supervised Learning
"Models are like roommates: too simple, they do nothing; too complicated, they throw wild parties in your dataset. You want the one who cleans the dishes and respects your privacy." — Your slightly dramatic ML TA
Hook: Why this matters (and why your model might secretly be trash)
You trained a model, it got 99% accuracy on the training data, and then tanked on new examples. Or maybe it can't even fit the training set well and performs poorly everywhere. Both are signs of poor generalization. In the previous sections we already met two important neighbors of this problem:
- Inputs, Targets, and Hypothesis Space — which taught us that the capacity of the hypothesis space determines what functions the model can represent.
- Bias–Variance Trade-off — which showed that expected error decomposes into bias, variance, and irreducible noise.
Underfitting and overfitting are the everyday faces of those theoretical ideas. Let's make them uncomfortably practical.
Definitions (crisp, like a scalpel)
Underfitting: The model is too simple for the underlying pattern in the data. It can't achieve low training error. This is high bias. Think: a linear model trying to fit a spiral.
Overfitting: The model is too flexible and learns noise or idiosyncrasies in the training set. Training error is low, but validation/test error is high. This is high variance. Think: a 50th-degree polynomial on 20 points.
Under- vs. over-fitting is basically a struggle between "I won't learn enough" and "I'll learn everything, including the weird stuff." Balance is the goal.
Diagnostics: How to tell what’s going wrong
Look at training vs validation/test error. Patterns tell stories:
- Training error high, validation error ≈ training error → Underfitting.
- Training error low, validation error high → Overfitting.
- Both low → Success.
Learning curves (visual rule-of-thumb)
Plot error vs number of training examples for both training and validation sets.
Typical shapes:
- Underfitting: both errors high and converge.
- Overfitting: training error low, validation error high; validation error decreases as more data is added (often) because more data reduces variance.
ASCII sketch:
Error
|
| Underfit: -------\
| \ \
| Overfit: \ \__ validation
| \_____\__ training
+-----------------------------> Data size
Causes (the rogues' gallery)
- Hypothesis space capacity too small (e.g., linear model for nonlinear reality) → Underfit.
- Hypothesis space capacity too large without constraints (e.g., deep trees, high-degree polynomials) → Overfit.
- Too few training examples → makes complex models overfit easily.
- Noisy labels or features → amplifies overfitting risk.
- Poor feature engineering (irrelevant features add variance, missing important features increase bias).
Remember: capacity comes from model architecture, feature transformations, and hyperparameters (e.g., tree depth, number of neurons, polynomial degree).
Fixes: From blunt instruments to surgical strikes
For underfitting (increase flexibility / reduce bias)
- Use a richer hypothesis space: increase polynomial degree, add layers/neurons, or use a more expressive model.
- Add relevant features / do feature engineering.
- Reduce regularization (lower λ).
- Train longer if optimization isn't converged.
For overfitting (decrease variance / add constraints)
Regularization: L2 (Ridge), L1 (Lasso) — penalize large weights.
- Ridge objective (MSE + L2):
J(w) = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2 + \lambda \|w\|_2^2- Lasso uses (|w|_1) and can produce sparse solutions (feature selection).
Get more data (if possible) — often the cleanest solution.
Reduce model capacity: prune trees, reduce polynomial degree, decrease network size.
Early stopping on validation error during training (common in neural nets).
Use dropout, batch normalization, or other model-specific techniques.
Ensemble methods (bagging reduces variance; boosting trades bias/variance differently).
Pro tip: Regularization is basically telling the model: "Less drama, please." It trades a bit of training fit for more real-world sanity.
Concrete examples (so it stops being abstract)
Linear regression vs polynomial regression: If the true relationship is quadratic and you use linear, you'll underfit. Use a high-degree polynomial and you'll overfit to noise.
Decision Trees: Small max_depth → underfit. Huge max_depth → overfit (memorizes leaves). Random Forest (bagging) reduces variance.
Neural Networks: Tiny network → underfit. Massive network with no regularization and little data → overfit.
Practical checklist for model debugging
- Plot train and validation errors. Which pattern matches?
- If underfitting: increase complexity or features; check optimization.
- If overfitting: add regularization, collect more data, or reduce complexity.
- Use cross-validation to pick hyperparameters (e.g., λ, tree depth).
- Examine residuals: structure means bias; random noise means variance or label noise.
Quick scikit-learn pseudocode for diagnosing via learning curves
from sklearn.model_selection import learning_curve
train_sizes, train_scores, val_scores = learning_curve(model, X, y, cv=5)
# plot mean(train_scores) and mean(val_scores) vs train_sizes
A short table: Underfit vs Overfit quick reference
| Symptom | Training Error | Validation Error | Typical Fixes |
|---|---|---|---|
| Underfitting | High | High (similar) | Increase model capacity, add features, reduce regularization |
| Overfitting | Low | High | More data, stronger regularization, reduce capacity, ensembling |
Closing: The philosophical touchdown
Balancing underfitting and overfitting is the practical side of the Bias–Variance Trade-off and a direct consequence of the hypothesis space you picked earlier. Your model should be just expressive enough to capture the signal, but not so flexible that it learns the dataset's mood swings.
Final thought: don't trust a single number. Look at learning curves, validation behavior, and remember Occam's Razor — simpler models often win in the real world. Now go forth and make something that generalizes, not something that performs a circus for your training set.
"Generalization: it's less about being right about past data and more about behaving well with strangers." — That one wise dataset
Summary of key takeaways
- Underfitting = high bias; overfitting = high variance.
- Use learning curves and train/validation error patterns to diagnose.
- Fix underfitting by increasing capacity or features; fix overfitting by regularization, more data, or simpler models.
- Always validate hyperparameters with cross-validation; never let the test set babysit your model selection.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!