Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

Simple Linear Regression Geometry Multiple Linear Regression Formulation Assumptions and Diagnostics Ordinary Least Squares Solution Gradient Descent for OLS Heteroscedasticity and Robust Losses Transformations of Targets and Features Categorical Variables in Regression Interaction Terms in Linear Models Multicollinearity and VIF Prediction Intervals vs Confidence Intervals Feature Scaling Effects in OLS Handling Outliers with Huber and Quantile Loss Model Interpretation with Coefficients Baseline and Dummy Regressors

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Regression I: Linear Models

Regression I: Linear Models

24992 views

Build and diagnose linear regression models, understand assumptions, and evaluate predictive performance.

Content

6 of 15

Heteroscedasticity and Robust Losses

Heteroscedasticity: Chaos, Diagnostics, and Robust Losses (Sassy Stats Edition)

3070 views

intermediate

humorous

machine learning

statistics

gpt-5-mini

3070 views

Versions:

Heteroscedasticity: Chaos, Diagnostics, and Robust Losses (Sassy Stats Edition)

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Regression I: Linear Models — Heteroscedasticity and Robust Losses

"Your residuals look like confetti — fun at parties, terrible for inference."

You already know OLS like the back of your hand: the elegant closed-form beta-hat = (X^TX)^{-1}X^Ty, and how gradient descent can carry you to the same place when you feel like numerically suffering. You also just learned to split data carefully with cross‑validation so you don't leak future wisdom into your model. Now: what if the noise in your data is not behaving — if its variance changes with x, like a toddler on espresso? Welcome to heteroscedasticity, and the lovely world of robust losses that don't cry when outliers show up.

What is heteroscedasticity, and why should you care?

Homoscedasticity (the boring ideal): Var(ε_i | x_i) = σ^2 — same spread everywhere.
Heteroscedasticity (the reality): Var(ε_i | x_i) = σ_i^2 — spread depends on x.

Why it matters:

OLS coefficients remain unbiased, but they are no longer efficient (not BLUE).
Standard errors from the usual OLS formula become wrong, so p-values and confidence intervals lie to you.
If variance correlates with predictors, predictions will be heterogeneously unreliable — important if you care about uncertainty.

Quick diagnostic checklist you should run before declaring everything fine:

Plot residuals vs fitted values. (The canonical heteroscedasticity circus.)
Scale-Location plot (sqrt(|residuals|) vs fitted).
Formal tests: Breusch–Pagan, White test.

Two pragmatic responses: Adjust inference, or change the estimator

1) Fix inference: heteroscedasticity-consistent (robust) standard errors

If your goal is correct inference (p-values, CI) but you still like OLS point estimates, use heteroscedasticity-consistent covariance estimators (a.k.a. "robust SEs"). Famous variants: HC0, HC1, HC2, HC3.

HC0: sandwich estimator using squared residuals directly.
HC3: adjusts for leverage, good small-sample behavior (popular in applied econometrics).

These don't change beta-hat, they change your confidence about beta-hat. Very useful if you don't want to redesign your estimator.

2) Change the estimator: WLS, IRLS, or robust losses

If you want to be efficient (or protect against outliers), change the loss. Options:

Weighted Least Squares (WLS): If you know σ_i^2 up to scale, minimize

min_beta sum_i (1/sigma_i^2) * (y_i - x_i^T beta)^2

This has a closed-form: beta_hat = (X^T W X)^{-1} X^T W y, where W = diag(1/sigma_i^2). If weights are correct, WLS is BLUE.

Feasible WLS (or Fitted Weights): Often sigma_i^2 unknown. Estimate sigma_i^2 from a preliminary fit (e.g., regress squared residuals on x), form weights w_i = 1/hat{sigma_i^2}, and re-fit. Iteration gives you IRLS (Iteratively Reweighted Least Squares).
Robust loss functions: If outliers are the real enemy, swap L2 for something more forgiving.
- L1 (Least Absolute Deviations): minimize sum |residuals|. Robust to outliers in y. No closed form — solvable by linear programming or gradient methods.
- Huber loss: quadratic near zero, linear in tails — the compromise you didn't know you needed.
- Tukey's biweight and others: downweight big residuals more aggressively.

Important link to previous topics: many of these robust losses do not have a neat closed form like OLS does. So you either use IRLS (which reduces to weighted least squares each iteration and thus connects to your WLS algebra) or gradient descent — the same algorithmic muscles you practiced for OLS.

Short table: Loss functions at a glance

Loss	Sensitivity to outliers	Closed form?	Optimization
L2 (OLS)	High	Yes	Closed form or GD
L1	Low	No	Linear program, subgradient or GD
Huber (delta)	Medium	No	IRLS or GD

Practical recipes (a.k.a. How not to be surprised by noisy data)

Diagnose first: visualize residuals and run BP or White tests.
If heteroscedasticity is present but you only care about coefficients: compute robust SEs (HC3 if unsure).
If heteroscedasticity is structural (predictable by x), try WLS:
- If you know variances, plug them into W.
- If not, estimate variances from a preliminary fit and run Feasible WLS/IRLS.
If outliers are the problem (not just changing variance), use L1/Huber/Tukey.
Always do model selection and evaluation with correct resampling: estimate weights or fit robust models within each training fold — do not peek at validation/test residuals to form weights (this is leakage!).

"If you estimate weights using the whole dataset and then CV — congratulations, you invented data leakage."

Practical pseudocode for IRLS (Huber-friendly):

initialize beta
for iter in 1..max_iter:
    residuals = y - X @ beta
    weights = huber_weights(residuals, delta)
    W = diag(weights)
    beta = inv(X.T @ W @ X) @ (X.T @ W @ y)
    if converged: break

Huber weights are basically 1 for small residuals and delta / |r| for large residuals.

Evaluation considerations & cross-validation

You already learned to avoid leakage in CV. Now add weight/variance modeling to the list of things to do inside the training fold.

If using Feasible WLS or any estimator that needs a preliminary fit to compute weights, do that inside each fold.
If your error variance is heteroscedastic, choose evaluation metrics that reflect your goals: mean absolute error may be preferable if you worry about outliers; weighted MSE if you care about relative errors across variance regimes.
Consider time-based splits carefully: if variance evolves over time (financial volatility!), weights estimated on older data may be invalid — treat variance modeling like any nonstationary feature.

Bonus: Modeling the variance — a slightly fancy option

You can model both mean and variance: assume y | x ~ N(mu(x), sigma^2(x)). Fit mu(x) (e.g., linear) and model log sigma^2(x) as another linear function. This gives you principled uncertainty estimates and can be fit via maximum likelihood or iterated methods. Great for heteroscedastic prediction tasks (pricing, risk, etc.).

Takeaways (the pep talk)

Heteroscedasticity breaks your standard errors and efficiency, but your coefficients stay unbiased. Don't pretend it didn't happen.
Use robust SEs for correct inference without changing your point estimates.
Use WLS/IRLS/robust-loss estimators when variance or outliers affect prediction quality or efficiency.
Everything that requires estimating weights or variances must be inside the training fold during CV — otherwise you just invented data leakage and should be ashamed (lovingly).

Go forth: visualize, test, pick a loss that reflects your risk attitude, and always, always keep one eye on those residuals. They tell stories — sometimes romantic tragedies, sometimes blessed comedies — and you should listen.

Version info: this builds on the OLS closed-form and gradient-descent material, and extends your cross-validation discipline into the arena of heteroscedastic modeling and robust optimization.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics