jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Supervised Machine Learning: Regression and Classification
Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

Simple Linear Regression GeometryMultiple Linear Regression FormulationAssumptions and DiagnosticsOrdinary Least Squares SolutionGradient Descent for OLSHeteroscedasticity and Robust LossesTransformations of Targets and FeaturesCategorical Variables in RegressionInteraction Terms in Linear ModelsMulticollinearity and VIFPrediction Intervals vs Confidence IntervalsFeature Scaling Effects in OLSHandling Outliers with Huber and Quantile LossModel Interpretation with CoefficientsBaseline and Dummy Regressors

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Regression I: Linear Models

Regression I: Linear Models

24980 views

Build and diagnose linear regression models, understand assumptions, and evaluate predictive performance.

Content

6 of 15

Heteroscedasticity and Robust Losses

Heteroscedasticity: Chaos, Diagnostics, and Robust Losses (Sassy Stats Edition)
3069 views
intermediate
humorous
machine learning
statistics
gpt-5-mini
3069 views

Versions:

Heteroscedasticity: Chaos, Diagnostics, and Robust Losses (Sassy Stats Edition)

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Regression I: Linear Models — Heteroscedasticity and Robust Losses

"Your residuals look like confetti — fun at parties, terrible for inference."

You already know OLS like the back of your hand: the elegant closed-form beta-hat = (X^TX)^{-1}X^Ty, and how gradient descent can carry you to the same place when you feel like numerically suffering. You also just learned to split data carefully with cross‑validation so you don't leak future wisdom into your model. Now: what if the noise in your data is not behaving — if its variance changes with x, like a toddler on espresso? Welcome to heteroscedasticity, and the lovely world of robust losses that don't cry when outliers show up.


What is heteroscedasticity, and why should you care?

  • Homoscedasticity (the boring ideal): Var(ε_i | x_i) = σ^2 — same spread everywhere.
  • Heteroscedasticity (the reality): Var(ε_i | x_i) = σ_i^2 — spread depends on x.

Why it matters:

  • OLS coefficients remain unbiased, but they are no longer efficient (not BLUE).
  • Standard errors from the usual OLS formula become wrong, so p-values and confidence intervals lie to you.
  • If variance correlates with predictors, predictions will be heterogeneously unreliable — important if you care about uncertainty.

Quick diagnostic checklist you should run before declaring everything fine:

  1. Plot residuals vs fitted values. (The canonical heteroscedasticity circus.)
  2. Scale-Location plot (sqrt(|residuals|) vs fitted).
  3. Formal tests: Breusch–Pagan, White test.

Two pragmatic responses: Adjust inference, or change the estimator

1) Fix inference: heteroscedasticity-consistent (robust) standard errors

If your goal is correct inference (p-values, CI) but you still like OLS point estimates, use heteroscedasticity-consistent covariance estimators (a.k.a. "robust SEs"). Famous variants: HC0, HC1, HC2, HC3.

  • HC0: sandwich estimator using squared residuals directly.
  • HC3: adjusts for leverage, good small-sample behavior (popular in applied econometrics).

These don't change beta-hat, they change your confidence about beta-hat. Very useful if you don't want to redesign your estimator.

2) Change the estimator: WLS, IRLS, or robust losses

If you want to be efficient (or protect against outliers), change the loss. Options:

  • Weighted Least Squares (WLS): If you know σ_i^2 up to scale, minimize
min_beta sum_i (1/sigma_i^2) * (y_i - x_i^T beta)^2

This has a closed-form: beta_hat = (X^T W X)^{-1} X^T W y, where W = diag(1/sigma_i^2). If weights are correct, WLS is BLUE.

  • Feasible WLS (or Fitted Weights): Often sigma_i^2 unknown. Estimate sigma_i^2 from a preliminary fit (e.g., regress squared residuals on x), form weights w_i = 1/hat{sigma_i^2}, and re-fit. Iteration gives you IRLS (Iteratively Reweighted Least Squares).

  • Robust loss functions: If outliers are the real enemy, swap L2 for something more forgiving.

    • L1 (Least Absolute Deviations): minimize sum |residuals|. Robust to outliers in y. No closed form — solvable by linear programming or gradient methods.
    • Huber loss: quadratic near zero, linear in tails — the compromise you didn't know you needed.
    • Tukey's biweight and others: downweight big residuals more aggressively.

Important link to previous topics: many of these robust losses do not have a neat closed form like OLS does. So you either use IRLS (which reduces to weighted least squares each iteration and thus connects to your WLS algebra) or gradient descent — the same algorithmic muscles you practiced for OLS.


Short table: Loss functions at a glance

Loss Sensitivity to outliers Closed form? Optimization
L2 (OLS) High Yes Closed form or GD
L1 Low No Linear program, subgradient or GD
Huber (delta) Medium No IRLS or GD

Practical recipes (a.k.a. How not to be surprised by noisy data)

  1. Diagnose first: visualize residuals and run BP or White tests.
  2. If heteroscedasticity is present but you only care about coefficients: compute robust SEs (HC3 if unsure).
  3. If heteroscedasticity is structural (predictable by x), try WLS:
    • If you know variances, plug them into W.
    • If not, estimate variances from a preliminary fit and run Feasible WLS/IRLS.
  4. If outliers are the problem (not just changing variance), use L1/Huber/Tukey.
  5. Always do model selection and evaluation with correct resampling: estimate weights or fit robust models within each training fold — do not peek at validation/test residuals to form weights (this is leakage!).

"If you estimate weights using the whole dataset and then CV — congratulations, you invented data leakage."

Practical pseudocode for IRLS (Huber-friendly):

initialize beta
for iter in 1..max_iter:
    residuals = y - X @ beta
    weights = huber_weights(residuals, delta)
    W = diag(weights)
    beta = inv(X.T @ W @ X) @ (X.T @ W @ y)
    if converged: break

Huber weights are basically 1 for small residuals and delta / |r| for large residuals.


Evaluation considerations & cross-validation

You already learned to avoid leakage in CV. Now add weight/variance modeling to the list of things to do inside the training fold.

  • If using Feasible WLS or any estimator that needs a preliminary fit to compute weights, do that inside each fold.
  • If your error variance is heteroscedastic, choose evaluation metrics that reflect your goals: mean absolute error may be preferable if you worry about outliers; weighted MSE if you care about relative errors across variance regimes.
  • Consider time-based splits carefully: if variance evolves over time (financial volatility!), weights estimated on older data may be invalid — treat variance modeling like any nonstationary feature.

Bonus: Modeling the variance — a slightly fancy option

You can model both mean and variance: assume y | x ~ N(mu(x), sigma^2(x)). Fit mu(x) (e.g., linear) and model log sigma^2(x) as another linear function. This gives you principled uncertainty estimates and can be fit via maximum likelihood or iterated methods. Great for heteroscedastic prediction tasks (pricing, risk, etc.).


Takeaways (the pep talk)

  • Heteroscedasticity breaks your standard errors and efficiency, but your coefficients stay unbiased. Don't pretend it didn't happen.
  • Use robust SEs for correct inference without changing your point estimates.
  • Use WLS/IRLS/robust-loss estimators when variance or outliers affect prediction quality or efficiency.
  • Everything that requires estimating weights or variances must be inside the training fold during CV — otherwise you just invented data leakage and should be ashamed (lovingly).

Go forth: visualize, test, pick a loss that reflects your risk attitude, and always, always keep one eye on those residuals. They tell stories — sometimes romantic tragedies, sometimes blessed comedies — and you should listen.


Version info: this builds on the OLS closed-form and gradient-descent material, and extends your cross-validation discipline into the arena of heteroscedastic modeling and robust optimization.

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics