Regression I: Linear Models
Build and diagnose linear regression models, understand assumptions, and evaluate predictive performance.
Content
Heteroscedasticity and Robust Losses
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Regression I: Linear Models — Heteroscedasticity and Robust Losses
"Your residuals look like confetti — fun at parties, terrible for inference."
You already know OLS like the back of your hand: the elegant closed-form beta-hat = (X^TX)^{-1}X^Ty, and how gradient descent can carry you to the same place when you feel like numerically suffering. You also just learned to split data carefully with cross‑validation so you don't leak future wisdom into your model. Now: what if the noise in your data is not behaving — if its variance changes with x, like a toddler on espresso? Welcome to heteroscedasticity, and the lovely world of robust losses that don't cry when outliers show up.
What is heteroscedasticity, and why should you care?
- Homoscedasticity (the boring ideal): Var(ε_i | x_i) = σ^2 — same spread everywhere.
- Heteroscedasticity (the reality): Var(ε_i | x_i) = σ_i^2 — spread depends on x.
Why it matters:
- OLS coefficients remain unbiased, but they are no longer efficient (not BLUE).
- Standard errors from the usual OLS formula become wrong, so p-values and confidence intervals lie to you.
- If variance correlates with predictors, predictions will be heterogeneously unreliable — important if you care about uncertainty.
Quick diagnostic checklist you should run before declaring everything fine:
- Plot residuals vs fitted values. (The canonical heteroscedasticity circus.)
- Scale-Location plot (sqrt(|residuals|) vs fitted).
- Formal tests: Breusch–Pagan, White test.
Two pragmatic responses: Adjust inference, or change the estimator
1) Fix inference: heteroscedasticity-consistent (robust) standard errors
If your goal is correct inference (p-values, CI) but you still like OLS point estimates, use heteroscedasticity-consistent covariance estimators (a.k.a. "robust SEs"). Famous variants: HC0, HC1, HC2, HC3.
- HC0: sandwich estimator using squared residuals directly.
- HC3: adjusts for leverage, good small-sample behavior (popular in applied econometrics).
These don't change beta-hat, they change your confidence about beta-hat. Very useful if you don't want to redesign your estimator.
2) Change the estimator: WLS, IRLS, or robust losses
If you want to be efficient (or protect against outliers), change the loss. Options:
- Weighted Least Squares (WLS): If you know σ_i^2 up to scale, minimize
min_beta sum_i (1/sigma_i^2) * (y_i - x_i^T beta)^2
This has a closed-form: beta_hat = (X^T W X)^{-1} X^T W y, where W = diag(1/sigma_i^2). If weights are correct, WLS is BLUE.
Feasible WLS (or Fitted Weights): Often sigma_i^2 unknown. Estimate sigma_i^2 from a preliminary fit (e.g., regress squared residuals on x), form weights w_i = 1/hat{sigma_i^2}, and re-fit. Iteration gives you IRLS (Iteratively Reweighted Least Squares).
Robust loss functions: If outliers are the real enemy, swap L2 for something more forgiving.
- L1 (Least Absolute Deviations): minimize sum |residuals|. Robust to outliers in y. No closed form — solvable by linear programming or gradient methods.
- Huber loss: quadratic near zero, linear in tails — the compromise you didn't know you needed.
- Tukey's biweight and others: downweight big residuals more aggressively.
Important link to previous topics: many of these robust losses do not have a neat closed form like OLS does. So you either use IRLS (which reduces to weighted least squares each iteration and thus connects to your WLS algebra) or gradient descent — the same algorithmic muscles you practiced for OLS.
Short table: Loss functions at a glance
| Loss | Sensitivity to outliers | Closed form? | Optimization |
|---|---|---|---|
| L2 (OLS) | High | Yes | Closed form or GD |
| L1 | Low | No | Linear program, subgradient or GD |
| Huber (delta) | Medium | No | IRLS or GD |
Practical recipes (a.k.a. How not to be surprised by noisy data)
- Diagnose first: visualize residuals and run BP or White tests.
- If heteroscedasticity is present but you only care about coefficients: compute robust SEs (HC3 if unsure).
- If heteroscedasticity is structural (predictable by x), try WLS:
- If you know variances, plug them into W.
- If not, estimate variances from a preliminary fit and run Feasible WLS/IRLS.
- If outliers are the problem (not just changing variance), use L1/Huber/Tukey.
- Always do model selection and evaluation with correct resampling: estimate weights or fit robust models within each training fold — do not peek at validation/test residuals to form weights (this is leakage!).
"If you estimate weights using the whole dataset and then CV — congratulations, you invented data leakage."
Practical pseudocode for IRLS (Huber-friendly):
initialize beta
for iter in 1..max_iter:
residuals = y - X @ beta
weights = huber_weights(residuals, delta)
W = diag(weights)
beta = inv(X.T @ W @ X) @ (X.T @ W @ y)
if converged: break
Huber weights are basically 1 for small residuals and delta / |r| for large residuals.
Evaluation considerations & cross-validation
You already learned to avoid leakage in CV. Now add weight/variance modeling to the list of things to do inside the training fold.
- If using Feasible WLS or any estimator that needs a preliminary fit to compute weights, do that inside each fold.
- If your error variance is heteroscedastic, choose evaluation metrics that reflect your goals: mean absolute error may be preferable if you worry about outliers; weighted MSE if you care about relative errors across variance regimes.
- Consider time-based splits carefully: if variance evolves over time (financial volatility!), weights estimated on older data may be invalid — treat variance modeling like any nonstationary feature.
Bonus: Modeling the variance — a slightly fancy option
You can model both mean and variance: assume y | x ~ N(mu(x), sigma^2(x)). Fit mu(x) (e.g., linear) and model log sigma^2(x) as another linear function. This gives you principled uncertainty estimates and can be fit via maximum likelihood or iterated methods. Great for heteroscedastic prediction tasks (pricing, risk, etc.).
Takeaways (the pep talk)
- Heteroscedasticity breaks your standard errors and efficiency, but your coefficients stay unbiased. Don't pretend it didn't happen.
- Use robust SEs for correct inference without changing your point estimates.
- Use WLS/IRLS/robust-loss estimators when variance or outliers affect prediction quality or efficiency.
- Everything that requires estimating weights or variances must be inside the training fold during CV — otherwise you just invented data leakage and should be ashamed (lovingly).
Go forth: visualize, test, pick a loss that reflects your risk attitude, and always, always keep one eye on those residuals. They tell stories — sometimes romantic tragedies, sometimes blessed comedies — and you should listen.
Version info: this builds on the OLS closed-form and gradient-descent material, and extends your cross-validation discipline into the arena of heteroscedastic modeling and robust optimization.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!