Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

Supervised vs Unsupervised vs Reinforcement Inputs, Targets, and Hypothesis Space Bias–Variance Trade-off Underfitting and Overfitting Empirical Risk Minimization Loss Functions Overview Probabilistic Perspective of Supervised Learning Optimization Basics for ML Gradient Descent and Variants Stochasticity and Mini-batching Evaluation vs Training Objectives Data Leakage Pitfalls Reproducibility and Random Seeds Problem Framing: Regression vs Classification Types of Supervision and Labels

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Foundations of Supervised Learning

Foundations of Supervised Learning

14132 views

Core concepts, goals, trade-offs, and terminology that underpin regression and classification.

Content

5 of 15

Empirical Risk Minimization

ERM — The Empirical Hustle

1223 views

intermediate

humorous

science

visual

gpt-5-mini

1223 views

Versions:

ERM — The Empirical Hustle

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Empirical Risk Minimization — The Empirical Hustle (But Make It Principled)

"Don't trust the training set — it's like trusting your ex: compelling, biased, and not the whole story."

You already met the troublemakers: underfitting and overfitting (we saw how too-simple models miss patterns and too-complex ones hallucinate them), and you already wrestled with the bias–variance trade-off (the eternal tug-of-war between systematic error and noisy sensitivity). Empirical Risk Minimization (ERM) is the core principle that sits under those phenomena. It’s both the simplest idea in supervised learning and the reason we need complexity control, regularization, and cross-validation — all the things that save us from beautiful but useless models.

What is ERM, in plain-ish English

Risk (true risk): the expected loss of a predictor f over the unknown data distribution P(X, Y). Symbolically,

R(f) = E_{(X,Y)~P}[L(Y, f(X))]

Empirical risk: what we can actually compute from our finite training set of n examples:

R_n(f) = (1/n) * sum_{i=1..n} L(y_i, f(x_i))

ERM says: pick the function f in your hypothesis class F that minimizes empirical risk R_n(f). That’s it. No ceremony. No occult rituals. Just minimize the average loss on the data you have.

Bold claim: ERM is the algorithmic manifestation of ‘learn from examples’. But it’s blunt — it will happily overfit if you let it.

Why ERM alone is both brilliant and dangerous

Brilliant: It’s computationally straightforward (reduce to optimization), conceptually simple, and often works when your hypothesis class is appropriate.
Dangerous: Minimizing R_n can chase noise. If F is too large/expressive, the minimizer of R_n may be a perfect fit to training labels but terrible out-of-sample. That’s exactly where overfitting comes from — remember Position 4 in Foundations: Underfitting and Overfitting.

Quick mental image: ERM is like copying the exact text of your study notes because they look good in front of you. But the exam (true distribution) asks slightly different questions.

How ERM links to bias and variance (Position 3 revisit)

If F is tiny (low capacity), ERM yields a high-bias but low-variance predictor. It can't fit complex patterns — underfitting.
If F is huge (high capacity), ERM can reduce bias (can fit training data tightly) but variance explodes — predictions wobble wildly with new samples.

So ERM is the stage; bias and variance are the actors. We control the play by choosing F or adding regularization.

Practical fixes: Penalized ERM and Structural Risk Minimization

Two siblings of ERM that keep it honest:

Penalized ERM (aka regularization):

min_f R_n(f) + λ · Ω(f)

Ω(f) is a complexity penalty (e.g., ||w||^2 for linear models). λ tunes the bias–variance trade-off: bigger λ → simpler model (more bias, less variance).

Structural Risk Minimization (SRM): Arrange hypothesis classes F_1 ⊂ F_2 ⊂ ... and choose the class that minimizes a bound on true risk (Vapnik's idea). Equivalent intuition: constrain capacity first, then do ERM.

Both approaches inject prior preference for simpler functions — because simpler rules generalize better unless data says otherwise.

When does ERM actually work? (A bit of theory — no PhD required)

ERM is consistent if the empirical risk converges uniformly to the true risk over the hypothesis class F. In plain terms:

If sup_{f in F} |R_n(f) - R(f)| → 0 as n → ∞ (uniform convergence), then the ERM minimizer approaches the best-in-class true-risk minimizer.

Tools that quantify this convergence: VC dimension, Rademacher complexity, covering numbers. Short story: smaller complexity → faster convergence → ERM is safer.

Surrogate losses, classification, and convexity

For classification, the 0–1 loss is the true target but it's non-convex and hard to optimize. ERM with 0–1 loss is intractable, so we use surrogate losses (hinge loss for SVMs, logistic loss for logistic regression). This is still ERM — just with a different loss that is more optimization-friendly and often offers good generalization guarantees.

A simple worked example (polynomial regression)

Imagine trying to fit y = sin(x) + noise with polynomials of degree d. ERM on training set will produce:

d small → underfit (high bias)
d large → near-zero training error but terrible on test (high variance)

Regularized ERM with a penalty on polynomial coefficients (ridge) nudges the solution toward smoother polynomials and reduces variance.

Pseudocode sketch:

for degree in 0..D:
    fit polynomial of degree with regularization λ via ERM
    estimate validation error
select model with smallest validation error

Cross-validation is the practical way to estimate generalization risk when you don’t have infinite data.

A tiny comparison table

Quantity	Meaning	Danger / Use
True risk R(f)	Expected loss over P	Gold standard (unknown)
Empirical risk R_n(f)	Average loss on training set	What ERM minimizes — can mislead if n small or F huge
Generalization gap	R(f) − R_n(f)	We want this small; complexity control helps

Quick checklist: Using ERM well

Choose a hypothesis class with appropriate capacity (not too small, not monstrous).
Use regularization (penalized ERM) to control variance.
Use surrogate losses when optimization of the true loss is infeasible.
Validate: cross-validation or holdout sets to estimate generalization.
Watch learning curves: if training and validation errors both high → increase capacity; if training error low but validation high → add regularization or reduce capacity.

Closing rant/insight

ERM is the workhorse of supervised learning: simple, powerful, and blunt. It’s the rule that says ‘fit what you see’, but generalization is the art of knowing when not to trust what you see. The real craft is balancing the class capacity, the penalty, the loss, and the data. Think of ERM as the baseline recipe — you can cook a decent meal with it, but seasoning (regularization, validation) makes it edible for anyone other than the training set.

Key takeaways:

ERM minimizes empirical risk; good in principle, risky in practice without capacity control.
Regularization and SRM are ways to keep ERM from overfitting — this is directly tied to the bias–variance trade-off.
Uniform convergence/complexity measures tell you when ERM is theoretically safe.

Final thought: If ERM were a person, it would be the friend who repeats what everyone at the party says. Useful for gossip, disastrous if you want original insight. Train it well.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics