Exploratory Data Analysis for Predictive Modeling
EDA methods tailored to supervised tasks to reveal signal, distribution shifts, and modeling risks.
Content
Detecting Nonlinearity and Heteroscedasticity
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Detecting Nonlinearity and Heteroscedasticity — the plot twist your model didn't see coming
"If your residual plot looks like a hairball, your model is lying to you." — probably me, loudly, in a data lab
You already explored the target distribution visually and checked for class imbalance and target weirdness (see: Visualization for Regression Targets; Visualization for Class Imbalance). You also did the sensible stuff in Data Wrangling and Feature Engineering: cleaned, encoded, scaled, and guarded against leakage. Nice. Now it's time for the emotional heart-to-heart with your model: ask whether the relationship you're modeling is linear enough to justify a plain old linear model, and whether the model's errors behave themselves.
Why this matters
- Nonlinearity means your predictor and target have a relationship that isn't a straight line. If your model pretends it is linear, you'll get biased predictions. That feeling when your friend says they 'only drink water' and then chugs an espresso shot — same betrayal.
- Heteroscedasticity means the variance of errors changes across levels of a predictor. If you ignore it, your uncertainty estimates and hypothesis tests will be wrong; confidence intervals will be lying little confidence liars.
Quick checklist (what we will do)
- Visual checks: residual vs fitted, grouped variance plots, scale-location plot.
- Formal tests: Breusch-Pagan, White, Goldfeld-Quandt.
- Remedial actions: transforms, polynomials/splines, GAMs, weighted methods, heteroscedasticity-robust inference.
- Special note: classification models have their own nonlinearity issues (link function, calibration).
Visual detective work (start here)
Why start visually? Because numbers lie and plots tell the truth. Visual checks are quick and often decisive.
1) Residuals vs Fitted
- Plot: residuals on y-axis, fitted values on x-axis.
- What to look for: a funnel shape (widening or narrowing) indicates heteroscedasticity. A systematic curve pattern indicates nonlinearity.
Code sketch:
# Python sketch
import matplotlib.pyplot as plt
fitted = model.predict(X)
resid = y - fitted
plt.scatter(fitted, resid, alpha=0.6)
plt.axhline(0, color='k', linestyle='--')
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.show()
Ask: Is the cloud centered around zero and uniform? If not, you found a problem.
2) Scale-Location (Spread) Plot
- Plot sqrt(|residuals|) against fitted values.
- This makes heteroscedasticity patterns more visible.
3) Residuals vs Predictor
Plot residuals against each important predictor. If you see curvature, your linear terms are missing the beat.
4) Binned variance plot
Group data by quantiles of a predictor, compute variance of residuals per bin, and plot. This clarifies trends when scatter is noisy.
Formal statistical tests (they won't replace plots)
- Breusch-Pagan test: tests whether residual variance can be explained by predictors. Good general-purpose test.
- White test: allows for nonlinearity in variance, tests more general specifications.
- Goldfeld-Quandt test: compares variance across two subsamples; useful if you suspect variance increases with a predictor.
Remember: tests can be sensitive to non-normality of errors and outliers. Use them as complements to plots, not scripture.
Detecting nonlinearity more formally
- Partial dependence plots (PDPs) and individual conditional expectation (ICE) plots: great for black-box models, but useful even with linear models to see if relationship looks straight.
- Component plus residual (partial residual) plots: reveal whether adding polynomial terms might help.
- Correlation + scatter + loess smoother: fit a lowess curve; if it bulges, you need nonlinear features.
Quick code idea for lowess:
from statsmodels.nonparametric.smoothers_lowess import lowess
sm = lowess(y, x, frac=0.3)
plt.plot(x, sm[:,1])
Remedies and when to use them
Table: Problem -> Quick Fix -> When it's best
| Problem | Quick Fix | When to use |
|---|---|---|
| Nonlinearity (mild) | Add polynomial terms (x^2, x^3) | When shape is simple curve, few features |
| Nonlinearity (complex) | Splines, regression trees, GAMs | When curve is wiggly or you want interpretable smoothness |
| Heteroscedasticity | Transform target (log, Box-Cox) or Weighted Least Squares | When variance grows with level; transform can stabilize |
| Heteroscedasticity (inference) | Robust SEs (HC0-HC3), bootstrap | When you only need correct CIs/p-values |
Notes:
- Transforming the target can fix both nonlinearity and heteroscedasticity at once (log often tames multiplicative error patterns). But remember interpretability changes.
- Weighted Least Squares gives more weight to observations with lower variance; requires estimating a weight function (often via modeling residual variance).
- Generalized Additive Models (GAMs) are elegant: they model nonlinearity with smooth functions and can also model variance if extended (e.g., mgcv in R can fit location-scale models).
Classification models: the twist
You still care about nonlinearity: if logit link doesn't fit, predicted probabilities can be systematically off (miscalibration). Diagnostics:
- Calibration plot: bin predicted probabilities and compare observed frequency.
- Residual-like checks: deviance residuals vs predictors.
Remedies: add nonlinear terms, use tree-based models, or recalibrate probabilities (isotonic regression, Platt scaling).
Practical workflow (do this in order)
- Fit your baseline model (after proper splitting/cross-validation!).
- Plot residuals vs fitted and residuals vs key predictors. Ask: curve? funnel? both?
- Fit a lowess smoother or partial residual plot to confirm nonlinearity.
- Run Breusch-Pagan to test heteroscedasticity if visual signs exist.
- Try a simple transform (log or Box-Cox). Re-evaluate.
- If transform insufficient, try polynomial/spline or a flexible model like GAM or tree ensembles.
- For inference, switch to robust SEs or WLS as needed.
Closing mic drop
Nonlinearity and heteroscedasticity are not bugs in the data, they're features of reality refusing to be simplified. Your job is to listen: plot, test, and adapt. Start with visual empathy, then apply formal tools, and only then choose a remedy that balances accuracy and interpretability.
Key takeaways:
- Always look at residuals; they will whisper the truth long before your metrics scream it.
- Use transforms, splines, GAMs, or robust methods depending on severity and your goals.
- For classification, pay special attention to calibration and link function adequacy.
Final thought: models are like friends — they work best when you accept their quirks and tailor your expectations. Fit the relationship, not the ego.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!