jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Supervised Machine Learning: Regression and Classification
Chapters

1Foundations of Supervised Learning

Supervised vs Unsupervised vs ReinforcementInputs, Targets, and Hypothesis SpaceBias–Variance Trade-offUnderfitting and OverfittingEmpirical Risk MinimizationLoss Functions OverviewProbabilistic Perspective of Supervised LearningOptimization Basics for MLGradient Descent and VariantsStochasticity and Mini-batchingEvaluation vs Training ObjectivesData Leakage PitfallsReproducibility and Random SeedsProblem Framing: Regression vs ClassificationTypes of Supervision and Labels

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Foundations of Supervised Learning

Foundations of Supervised Learning

14120 views

Core concepts, goals, trade-offs, and terminology that underpin regression and classification.

Content

6 of 15

Loss Functions Overview

Losses but Make It Sass
653 views
intermediate
humorous
science
gpt-5-mini
653 views

Versions:

Losses but Make It Sass

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Loss Functions Overview — Why your model cries when it’s wrong (and sometimes when it’s right)

"Pick your loss like you pick your battles: strategically, and with the memory of past trauma (outliers)."

You already know Empirical Risk Minimization (ERM): we choose a model that minimizes average loss on the training set. You also learned about underfitting and overfitting — the classic tug-of-war between bias and variance. Loss functions are the battleground. They determine what mistakes matter, how harshly we punish them, and how nicely optimization algorithms behave. This page takes you on a tour of the most common loss functions, why they exist, and what they mean for performance and generalization.


Quick map: what we want from a loss

  • Signal: Loss must reflect modelling goals (e.g., penalize confidence mistakes in classification).
  • Optimizable: Prefer convex and differentiable losses for easier training.
  • Robustness: Some losses shrug off outliers; others get torqued by them.
  • Probabilistic meaning: Some losses correspond to maximum likelihood under a noise model.

Think of loss as the referee’s whistle: loud and clear when a foul happens (large error), and consistent so players learn to play better.


Part A — Regression losses (real-valued targets)

1) Squared error / Mean Squared Error (MSE)

  • Definition: L(y, ŷ) = (y − ŷ)^2 (or averaged over dataset → MSE)
  • Intuition: Penalizes big errors heavily (quadratic). It’s the drama queen of losses.
  • Properties: Convex, differentiable, corresponds to Gaussian noise assumption (MLE).
  • When to use: Standard regression, when you want to penalize large deviations.
  • Caveat: Sensitive to outliers.

2) Mean Absolute Error (MAE / L1)

  • Definition: L(y, ŷ) = |y − ŷ|
  • Intuition: Treats errors linearly. Less dramatic than MSE.
  • Properties: Convex, but nondifferentiable at 0 (subgradients exist). Corresponds to Laplace noise (MLE).
  • When to use: If you want robustness to outliers or care about median behavior.

3) Huber loss — best of both worlds

  • Definition: Quadratic near zero error, linear past a threshold δ.
  • Intuition: Be gentle on small errors (MSE-like), but stop letting huge errors dominate (MAE-like).
  • When to use: You want differentiability and robustness.

4) Quantile loss

  • Used when you care about predicting a quantile (e.g., 90th percentile demand). Useful for heteroskedastic noise.

Part B — Classification losses (discrete labels)

1) 0–1 loss (the ground truth, but impractical)

  • Definition: L(y, ŷ) = 1 if misclassified, else 0.
  • Intuition: Exactly what we evaluate at test time for accuracy, but it’s discontinuous and nonconvex → terrible for optimization.
  • Use: Conceptual; not used for gradient-based training.

2) Logistic loss / Cross-Entropy

  • Definition (binary): L = −[y log p + (1−y) log(1−p)] where p = σ(f(x))
  • Intuition: Punishes confident but wrong predictions heavily.
  • Properties: Convex in the linear predictor for binary logistic, differentiable, probabilistic (MLE under Bernoulli).
  • When to use: Default for probabilistic classifiers and neural nets for classification.

3) Hinge loss (SVM family)

  • Definition: L(y, f) = max(0, 1 − y f)
  • Intuition: Wants not only correct classification but a margin of confidence.
  • Properties: Convex, but not differentiable at the hinge; encourages large margins.
  • When to use: Support vector machines and margin-based learning.

4) Softmax + Categorical Cross-Entropy

  • Definition: Multi-class generalization of logistic loss. Softmax converts scores to probabilities; cross-entropy compares to one-hot labels.
  • When to use: Standard in multi-class classification.

5) Focal loss (for class imbalance)

  • Why: Down-weights well-classified examples so the model focuses on hard, minority-class examples.
  • When to use: Highly imbalanced datasets (e.g., rare object detection).

Quick table: property cheat-sheet

Loss Convex? Differentiable? Robust to outliers? Probabilistic meaning
MSE Yes Yes No Gaussian noise (MLE)
MAE Yes Subgradients Yes Laplace noise (MLE)
Huber Yes Yes Moderately —
0–1 No No — —
Logistic / Cross-Entropy Often (w.r.t. scores) Yes Not robust to label noise Bernoulli / Categorical MLE
Hinge Yes Subgradient No Margin-based view

How loss choice links to ERM and over/underfitting

Remember ERM: empirical risk = average loss on training data. When you switch losses, you change the objective landscape — that affects model capacity and the kind of errors the optimizer prioritizes.

  • A large-outlier-sensitive loss (MSE) can drive the model to overfit to those outliers, increasing variance.
  • A robust loss (MAE, Huber) can reduce sensitivity to noisy points, which may improve generalization in messy real-world data.
  • Strong margin losses (hinge) implicitly regularize by demanding confident separation.

Choosing a loss is part of the modeler’s toolkit for balancing bias and variance. If you obsess only about model class (polynomial degree, tree depth) but ignore loss, you’re missing half the picture.


Optimization & practicality

  • Convex + smooth losses → easier guarantees, convex solvers or stable gradient descent.
  • Nonconvex losses (or nonconvex models like deep nets) rely on optimization heuristics; the shape of the loss still matters for convergence speed.

Pseudocode: one gradient step for parameter θ under loss L

# gradient descent step
θ ← θ − η * (1/N) * Σ_i ∇_θ L(y_i, f(x_i; θ))

If ∇_θ L is large from an outlier (e.g., squared loss), that single point can hijack updates.


Practical heuristics & checklist

  • If your dataset has clear outliers or heavy tails, try MAE or Huber.
  • For classification problems where you want probabilities, use cross-entropy.
  • If classes are heavily imbalanced, consider focal loss or class-weighted cross-entropy.
  • Want margins and interpretability? Hinge loss/SVMs are useful.
  • Always monitor not just training loss but validation performance related to your real metric (accuracy, F1, MAE, etc.).

Closing — TL;DR and a moral

  • Loss functions encode priorities: they tell the model what mistakes are sins and what are misdemeanors.
  • They interact with ERM and regularization: choosing a loss is as important as choosing model complexity for avoiding under/overfitting.
  • Optimization reality: prefer differentiable, well-behaved losses when you rely on gradient-based training, but don’t shy from hybrid losses (Huber) when data is messy.

Parting thought: if your model were a student, the loss is the syllabus. Make sure it grades what actually matters. Messy syllabus → confused student → weird exam performance.

Next up: pick one of these losses and we’ll practice — compare training curves, inspect gradients, and watch how outliers either get bullied or coddled. Ready to pick favorites and fight over them like academic roommates?

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics