Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

Supervised vs Unsupervised vs Reinforcement Inputs, Targets, and Hypothesis Space Bias–Variance Trade-off Underfitting and Overfitting Empirical Risk Minimization Loss Functions Overview Probabilistic Perspective of Supervised Learning Optimization Basics for ML Gradient Descent and Variants Stochasticity and Mini-batching Evaluation vs Training Objectives Data Leakage Pitfalls Reproducibility and Random Seeds Problem Framing: Regression vs Classification Types of Supervision and Labels

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Foundations of Supervised Learning

Foundations of Supervised Learning

14132 views

Core concepts, goals, trade-offs, and terminology that underpin regression and classification.

Content

3 of 15

Bias–Variance Trade-off

Bias-Variance But Make It Dramatic

4974 views

intermediate

humorous

machine learning

visual

gpt-5-mini

4974 views

Versions:

Bias-Variance But Make It Dramatic

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

The Bias–Variance Trade-off: Why Your Model Is Either Too Boring or Too Drama

You already know about inputs, targets, and the hypothesis space — congratulations, you have the toolbox. Now let’s decide whether we build a sensible chair or a Rube Goldberg contraption of a chair that collapses three days later.

Hook: The Tale of Two Models

Imagine two models predicting house prices from the same inputs. Model A always predicts the mean price. Model B fits every speck of noise in the training data — outliers, typos, ghosts of agents past. Model A is boring but steady. Model B is impressively specific and catastrophically wrong on new houses.

This is the bias–variance trade-off in a nutshell: simplicity vs flexibility, stability vs adaptability. We balance them to minimize error on new, unseen data — which is the whole point of supervised learning.

What is the bias–variance trade-off? (Short answer)

Bias measures errors from erroneous assumptions in the learning algorithm. High bias => underfitting.
Variance measures how much the model fluctuates for different training sets. High variance => overfitting.
Irreducible noise is the part of the target variability you simply cannot predict from inputs (measurement error, hidden variables).

Mathematically (for squared error):

E[(y − f̂(x))^2] = (Bias[f̂(x)])^2 + Var[f̂(x)] + Noise

This decomposition is your north star when selecting models, hyperparameters, or regularization.

Why this matters (connecting to what you already know)

You’ve seen the hypothesis space idea earlier: the family of functions your learning algorithm can pick from. A tiny hypothesis space (e.g., linear functions) tends to have high bias. A gigantic hypothesis space (e.g., very deep neural networks, high-degree polynomials) tends to have high variance unless tamed.

Also remember the difference between supervised vs unsupervised vs reinforcement: in supervised learning we care about generalizing from labeled examples. Bias and variance are all about generalization error — exactly the metric that separates supervised learning from, say, clustering weirdness.

Visual metaphors and intuition (because pictures are cheating in a good way)

Think of bias as a systematic error: a miscalibrated ruler that always subtracts 5 cm. No matter how many measurements you take, the error remains.
Think of variance as the shakiness of your hand. Each time you measure, the reading hops around. Average many shaky measurements and you might be close — but any one measurement can be all over the place.

Imagine throwing darts at a target:

High bias, low variance: all darts cluster tightly, but far from the bullseye.
Low bias, high variance: darts scatter around the bullseye — some hit it, many miss.
Low bias, low variance: direct centered cluster — the dream.

Concrete examples

Polynomial regression on a nonlinear trend
- Degree 1 (linear): high bias, low variance — underfits.
- Degree 15: low training error, high variance — overfits.
k-Nearest Neighbors
- k large: smoother predictions, higher bias, lower variance.
- k = 1: model memorizes training points, very low bias but huge variance.
Decision trees
- Very deep tree: low bias on training set, super high variance.
- Pruned shallow tree: higher bias, lower variance.

Table: quick cheat sheet

Model complexity	Typical bias	Typical variance	Concrete example
Low complexity	High	Low	Linear regression on complex curvy data
Medium	Moderate	Moderate	Regularized regression, pruned tree
High complexity	Low	High	Deep tree, high-degree polynomial

How to measure and act (practical recipes)

Plot learning curves (training vs validation error as function of training size or complexity). They tell you whether you’re underfitting or overfitting.
- If both training and validation error are high and close: increase model capacity (reduce bias).
- If training error is low but validation error is high: reduce variance via regularization, more data, or ensembling.
Cross-validation: your empirical oracle for estimating generalization. Use it to tune complexity.

Pseudocode: simple grid search with CV

for each hyperparameter value h in grid:
    train model M_h on training folds
    validate on validation fold
select h with smallest avg validation error
retrain M_h on full training data
evaluate on test set

Ways to reduce bias or variance (and the trade-offs)

To reduce bias (combat underfitting):
- Increase model complexity (richer hypothesis space)
- Add more informative features or interactions
- Reduce regularization strength
To reduce variance (combat overfitting):
- Add regularization (Ridge, Lasso) — penalize large weights
- Gather more data (the most surgical tool against variance)
- Use ensembling (bagging reduces variance; boosting reduces bias)
- Simplify the model (prune trees, reduce degree)

Note: Some techniques help both sides in practice. Feature engineering can reduce bias and variance by making patterns more learnable.

Cool nuance: ensembles, bias, and variance

Bagging (bootstrap aggregating) reduces variance by averaging multiple high-variance models (e.g., many deep trees) — think of averaging many shaky hands to get steadier aim.
Boosting sequentially reduces bias by focusing on mistakes — it can reduce bias dramatically but sometimes increases variance, so regularization or early stopping is needed.

Common mistakes and misconceptions

"More complex model is always better if I have enough data" — not true without regularization; complexity also increases the need for data and compute.
"Low training error means success" — no. Training error says nothing about variance and hence generalization.
Thinking of bias and variance as properties of the algorithm only — they depend on algorithm + hypothesis space + data distribution.

Quick diagnostic checklist (when your model misbehaves)

Plot learning curves. Are training/validation errors converging or diverging?
If underfitting: make model more expressive, add features, reduce regularization.
If overfitting: add data, use regularization, prune, or ensemble.
Use cross-validation to confirm your interventions actually reduce validation error.

Closing: the mindset you want

Bias–variance is less a formula and more an aesthetic decision in modeling. You are sculpting a function from finite data. Too rigid: you miss subtlety. Too flexible: you hallucinate patterns. The goal is not to annihilate bias or variance but to balance them for minimal expected error.

Powerful one-liner: Find the simplest model that is complex enough to capture the signal, and be suspicious of models that look like they could win a debating contest with noise.

Key takeaways:

Decompose error into bias, variance, and noise to guide fixes.
Tune complexity, regularization, data quantity, and ensembles as levers.
Always validate with held-out data or cross-validation.

Next up: we’ve discussed hypothesis spaces before — now we’ll apply these insights to concrete algorithms (linear models, trees, SVMs) and practice picking hyperparameters with cross-validated learning curves. Bring snacks.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics