Exploratory Data Analysis for Predictive Modeling
EDA methods tailored to supervised tasks to reveal signal, distribution shifts, and modeling risks.
Content
Visualization for Regression Targets
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Visualization for Regression Targets — The Fun, Scientific Part of Staring at Charts
You're past the low-level reconnaissance: you've looked at univariate distributions (Position 1), peeked at pairwise correlations (Position 2), and wrestled with messy data during Data Wrangling and Feature Engineering. Now it's showtime: visualize how predictors actually relate to your continuous target so your model doesn't learn nonsense.
Why this chapter matters (quick reminder)
You're not just making pretty plots. Good visual exploration:
- Reveals nonlinearities you should model (or transform).
- Exposes heteroscedasticity (variance changing with X) and outliers that ruin MSEs.
- Suggests useful feature transformations or binnings without leaking the test set.
Think of this as the tactical reconnaissance before you deploy predictive artillery.
Core plots and when to use them
Below: the high-utility toolkit for regression-target visualization, with quick notes on the what, why, and how.
1) Scatter plot + smoother (LOESS / LOWESS)
- Use when predictor is continuous. Shows shape: linear, quadratic, plateau, threshold.
- Plot dogma: raw points (alpha < 0.2 for big datasets) + smooth trend line + linear fit for comparison.
Code sketch (seaborn):
import seaborn as sns
sns.scatterplot(x='sqft', y='price', data=df, alpha=0.3)
sns.regplot(x='sqft', y='price', data=df, lowess=True, scatter=False, color='red')
2) Hexbin / 2D density for overplotting
- For huge datasets: scatter becomes soup. Use hexbin (matplotlib) or sns.kdeplot(… , fill=True).
- Colors convey density; still overlay smoothing if useful.
3) Binned-aggregates (bin numeric X -> mean Y + CI)
- When you want an interpretable summary: bin X (quantiles or equal-width), plot mean(Y) with error bars or boxplots.
- Great for revealing monotonic trends hidden by noise.
Pseudocode:
df['x_bin'] = pd.qcut(df['x'], 10)
agg = df.groupby('x_bin')['y'].agg(['mean','std','count'])
agg['se'] = agg['std']/np.sqrt(agg['count'])
4) Residual vs Fitted (early model-based EDA)
- Fit a quick simple model (linear, tree) and plot residuals vs fitted values.
- Use to catch heteroscedasticity, nonlinearity, and clusters of bad fits.
fitted = model.predict(X)
resid = y - fitted
sns.scatterplot(x=fitted, y=resid, alpha=0.3)
plt.axhline(0, color='k', linestyle='--')
Note: model-based plots are allowed in EDA — but be explicit about what you fit and why.
5) Categorical X vs Continuous Y: boxplots, violin + swarm
- For categorical predictors: use boxplots to see medians and spread; violins to see distribution shape; add swarm/jittered points when not too many observations.
6) Interaction plots / Facets
- Facet by a categorical variable to visualize conditional relationships (e.g., sqft vs price by neighborhood).
- Use sns.FacetGrid to make clean multi-panel comparisons.
7) Transformations: before-and-after plots
- If target or predictor is skewed, visualize relationship before and after log / Box–Cox / Yeo–Johnson transforms.
- Plot both panels side-by-side: often linearity improves and variance stabilizes.
Practical recipe: A step-by-step checklist (what to plot, in order)
- Reconfirm target distribution from Position 1 (histogram, skewness). If heavy skew, consider log transform and re-plot.
- For each continuous predictor X_i:
- Scatter + LOESS vs target (subsample if >100k rows).
- Hexbin/2D density if overplot.
- Binned mean ± SE to get a smoothed, interpretable signal.
- For categorical predictors C_j:
- Boxplot and violin of target by category. Add count annotation.
- If many categories, sort by median target and consider grouping rare levels into "other".
- Quick linear model fit per predictor (univariate) → store slope, R², and residual plot. This helps prioritize features.
- Multi-panel faceting for plausible interactions (e.g., sqft * neighborhood).
- Check heteroscedasticity: residuals vs fitted (from a quick multivariate model).
- Inspect high-leverage/outlier points (scatter + label suspicious ids). Decide: fix, transform, or keep.
Table: Which plot to pick (cheat sheet)
| Goal | Predictor type | Plot(s) |
|---|---|---|
| See shape of relationship | continuous | Scatter + LOESS; Hexbin if dense |
| Quick summary of trend | continuous | Binned mean ± CI |
| Categorical effect | categorical | Boxplot / violin + swarm |
| Heteroscedasticity / nonlinearity | any (model-based) | Residual vs fitted |
| Interactions | mix | FacetGrid / interaction plots |
Little heuristics and gotchas (because data punishes the unwise)
- Use alpha and point size to combat overplotting; sample for exploratory speed but keep a reproducible sample seed.
- Label your axes and include units. "x" and "y" are not good humans.
- Be conservative with transformations — always back-transform interpretation for stakeholders.
- When you bin, test different bin widths. Binning can create false plateau illusions.
- Watch leakage: do not create features that use future knowledge of the target. Feature engineering should be reproducible in deployment (see previous Data Wrangling notes).
Example — Quick walkthrough (housing prices)
- You've already seen price distribution (skewed right). You log-transform price.
- Plot sqft vs log(price): LOESS shows diminishing returns after ~3000 sqft.
- Facet by neighborhood: the slope differs — interaction suspected.
- Bin sqft into deciles and plot mean log(price) ± SE: nicer for a report.
- Fit a quick tree; residual vs fitted shows pockets of large residuals in high-priced neighborhoods → maybe missing a prestige variable.
Final takeaways (bite-sized and motivational)
- Visualization is hypothesis generation: use plots to suggest transformations, interactions, and missing features — not just to confirm your biases.
- Combine raw-point plots with summarized plots (bins, means) and model-based checks (residuals) for a 3D perspective.
- Always keep deployability in mind: any transformation you plan to use must be reproducible and not leak future info.
Visualization doesn't make models for you, but good visual work saves you from building models that lie.
Versioning/Next steps:
- After these visuals, you should be ready to: build candidate features (informed by observed shapes), select models that can capture the observed nonlinearity (splines, trees, GAMs), and design cross-validation that respects the structures you discovered (groups, time, neighborhoods).
"Go make one scatterplot that changes your model’s life. Preferably two."
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!