Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

Supervised vs Unsupervised vs Reinforcement Inputs, Targets, and Hypothesis Space Bias–Variance Trade-off Underfitting and Overfitting Empirical Risk Minimization Loss Functions Overview Probabilistic Perspective of Supervised Learning Optimization Basics for ML Gradient Descent and Variants Stochasticity and Mini-batching Evaluation vs Training Objectives Data Leakage Pitfalls Reproducibility and Random Seeds Problem Framing: Regression vs Classification Types of Supervision and Labels

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Foundations of Supervised Learning

Foundations of Supervised Learning

14132 views

Core concepts, goals, trade-offs, and terminology that underpin regression and classification.

Content

6 of 15

Loss Functions Overview

Losses but Make It Sass

653 views

intermediate

humorous

science

gpt-5-mini

653 views

Versions:

Losses but Make It Sass

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Loss Functions Overview — Why your model cries when it’s wrong (and sometimes when it’s right)

"Pick your loss like you pick your battles: strategically, and with the memory of past trauma (outliers)."

You already know Empirical Risk Minimization (ERM): we choose a model that minimizes average loss on the training set. You also learned about underfitting and overfitting — the classic tug-of-war between bias and variance. Loss functions are the battleground. They determine what mistakes matter, how harshly we punish them, and how nicely optimization algorithms behave. This page takes you on a tour of the most common loss functions, why they exist, and what they mean for performance and generalization.

Quick map: what we want from a loss

Signal: Loss must reflect modelling goals (e.g., penalize confidence mistakes in classification).
Optimizable: Prefer convex and differentiable losses for easier training.
Robustness: Some losses shrug off outliers; others get torqued by them.
Probabilistic meaning: Some losses correspond to maximum likelihood under a noise model.

Think of loss as the referee’s whistle: loud and clear when a foul happens (large error), and consistent so players learn to play better.

Part A — Regression losses (real-valued targets)

1) Squared error / Mean Squared Error (MSE)

Definition: L(y, ŷ) = (y − ŷ)^2 (or averaged over dataset → MSE)
Intuition: Penalizes big errors heavily (quadratic). It’s the drama queen of losses.
Properties: Convex, differentiable, corresponds to Gaussian noise assumption (MLE).
When to use: Standard regression, when you want to penalize large deviations.
Caveat: Sensitive to outliers.

2) Mean Absolute Error (MAE / L1)

Definition: L(y, ŷ) = |y − ŷ|
Intuition: Treats errors linearly. Less dramatic than MSE.
Properties: Convex, but nondifferentiable at 0 (subgradients exist). Corresponds to Laplace noise (MLE).
When to use: If you want robustness to outliers or care about median behavior.

3) Huber loss — best of both worlds

Definition: Quadratic near zero error, linear past a threshold δ.
Intuition: Be gentle on small errors (MSE-like), but stop letting huge errors dominate (MAE-like).
When to use: You want differentiability and robustness.

4) Quantile loss

Used when you care about predicting a quantile (e.g., 90th percentile demand). Useful for heteroskedastic noise.

Part B — Classification losses (discrete labels)

1) 0–1 loss (the ground truth, but impractical)

Definition: L(y, ŷ) = 1 if misclassified, else 0.
Intuition: Exactly what we evaluate at test time for accuracy, but it’s discontinuous and nonconvex → terrible for optimization.
Use: Conceptual; not used for gradient-based training.

2) Logistic loss / Cross-Entropy

Definition (binary): L = −[y log p + (1−y) log(1−p)] where p = σ(f(x))
Intuition: Punishes confident but wrong predictions heavily.
Properties: Convex in the linear predictor for binary logistic, differentiable, probabilistic (MLE under Bernoulli).
When to use: Default for probabilistic classifiers and neural nets for classification.

3) Hinge loss (SVM family)

Definition: L(y, f) = max(0, 1 − y f)
Intuition: Wants not only correct classification but a margin of confidence.
Properties: Convex, but not differentiable at the hinge; encourages large margins.
When to use: Support vector machines and margin-based learning.

4) Softmax + Categorical Cross-Entropy

Definition: Multi-class generalization of logistic loss. Softmax converts scores to probabilities; cross-entropy compares to one-hot labels.
When to use: Standard in multi-class classification.

5) Focal loss (for class imbalance)

Why: Down-weights well-classified examples so the model focuses on hard, minority-class examples.
When to use: Highly imbalanced datasets (e.g., rare object detection).

Quick table: property cheat-sheet

Loss	Convex?	Differentiable?	Robust to outliers?	Probabilistic meaning
MSE	Yes	Yes	No	Gaussian noise (MLE)
MAE	Yes	Subgradients	Yes	Laplace noise (MLE)
Huber	Yes	Yes	Moderately	—
0–1	No	No	—	—
Logistic / Cross-Entropy	Often (w.r.t. scores)	Yes	Not robust to label noise	Bernoulli / Categorical MLE
Hinge	Yes	Subgradient	No	Margin-based view

How loss choice links to ERM and over/underfitting

Remember ERM: empirical risk = average loss on training data. When you switch losses, you change the objective landscape — that affects model capacity and the kind of errors the optimizer prioritizes.

A large-outlier-sensitive loss (MSE) can drive the model to overfit to those outliers, increasing variance.
A robust loss (MAE, Huber) can reduce sensitivity to noisy points, which may improve generalization in messy real-world data.
Strong margin losses (hinge) implicitly regularize by demanding confident separation.

Choosing a loss is part of the modeler’s toolkit for balancing bias and variance. If you obsess only about model class (polynomial degree, tree depth) but ignore loss, you’re missing half the picture.

Optimization & practicality

Convex + smooth losses → easier guarantees, convex solvers or stable gradient descent.
Nonconvex losses (or nonconvex models like deep nets) rely on optimization heuristics; the shape of the loss still matters for convergence speed.

Pseudocode: one gradient step for parameter θ under loss L

# gradient descent step
θ ← θ − η * (1/N) * Σ_i ∇_θ L(y_i, f(x_i; θ))

If ∇_θ L is large from an outlier (e.g., squared loss), that single point can hijack updates.

Practical heuristics & checklist

If your dataset has clear outliers or heavy tails, try MAE or Huber.
For classification problems where you want probabilities, use cross-entropy.
If classes are heavily imbalanced, consider focal loss or class-weighted cross-entropy.
Want margins and interpretability? Hinge loss/SVMs are useful.
Always monitor not just training loss but validation performance related to your real metric (accuracy, F1, MAE, etc.).

Closing — TL;DR and a moral

Loss functions encode priorities: they tell the model what mistakes are sins and what are misdemeanors.
They interact with ERM and regularization: choosing a loss is as important as choosing model complexity for avoiding under/overfitting.
Optimization reality: prefer differentiable, well-behaved losses when you rely on gradient-based training, but don’t shy from hybrid losses (Huber) when data is messy.

Parting thought: if your model were a student, the loss is the syllabus. Make sure it grades what actually matters. Messy syllabus → confused student → weird exam performance.

Next up: pick one of these losses and we’ll practice — compare training curves, inspect gradients, and watch how outliers either get bullied or coddled. Ready to pick favorites and fight over them like academic roommates?

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics