Exploratory Data Analysis for Predictive Modeling
EDA methods tailored to supervised tasks to reveal signal, distribution shifts, and modeling risks.
Content
Pairwise Relationships and Correlations
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Pairwise Relationships and Correlations — The Romantic (and Sometimes Toxic) Lives of Features
"If a model had a group chat, pairwise relationships would be the gossip: who’s BFFs with whom, who’s secretly copying homework, and who should definitely be blocked."
You already know the solo acts — univariate distributions and summary stats. You also know how to wrestle messy features into submission (encoding, scaling, hashing, and avoiding leakage). Now we go to the group therapy session: how features behave in pairs. This is where you discover collusion, redundancy, interaction, and the occasional soulmate pair that lifts predictive power.
Why pairwise relationships matter for predictive modeling
- Redundancy: Two features that say the same thing in different words can bloat your model, cause multicollinearity, and make coefficients noisy. (Think:
total_spendandavg_spend_per_txn * n_txns.) - Signal discovery: A hidden relationship between A and B might explain the target better than either alone.
- Feature engineering cues: Strong nonlinear pairwise patterns scream for transformations or interaction terms.
- Model choice: Linear model vs tree-based — patterns in pair plots help decide which will work better.
Visual first: Exploratory plots you should actually use
1) Scatterplot (continuous vs continuous)
- Use scatterplots with a smooth line (LOESS) and color by the target.
- Watch for heteroscedasticity, clusters, and nonlinear trends.
Code snippet (pandas / seaborn):
import seaborn as sns
sns.scatterplot(x='feature_a', y='feature_b', hue='target', data=df, alpha=0.6)
sns.regplot(x='feature_a', y='feature_b', data=df, scatter=False, lowess=True, color='k')
2) Pairplot / scatter matrix
- Great for small feature sets (<= 10). Shows marginal distributions and pairwise scatter.
- For bigger sets: sample rows or plot a correlation-sorted subset.
3) Heatmap of correlation matrix
- Easy global view. Beware: Pearson-only view can lie if nonlinearity exists.
4) Categorical vs continuous: boxplots / violin plots
- Boxplots give median and spread differences across categories; violin shows density.
5) Categorical vs categorical: mosaic plots / contingency tables
- Look for dependencies; augment with chi-square or Cramér’s V.
Numbers speak: correlation measures and when to use them
| Relationship | Best measure(s) | When to use | Notes |
|---|---|---|---|
| Continuous — Continuous | Pearson | Linear association | Sensitive to outliers and nonlinearity |
| Continuous — Continuous | Spearman / Kendall | Monotonic but not linear | Robust to monotonic nonlinearity |
| Continuous — Binary target | Point-biserial (Pearson variant) | Quick check for classification | Equivalent to Pearson when target is 0/1 |
| Categorical — Categorical | Cramér’s V | Association strength | Based on chi-square; handles >2 categories |
| Mixed types | Mutual Information | Nonlinear / complex relations | Nonparametric; often used with continuous discretized |
Quick tip: If Pearson says 0 but scatter looks curved, Pearson is ghosting you — check Spearman or mutual information.
Practical workflow: from plots to actionable steps
- Start with a correlation heatmap for continuous features and the continuous target (if regression). Use Spearman if you suspect monotonic nonlinearity.
- For pairs with high absolute correlation (|r| > 0.8), inspect scatterplots and marginal distributions. Ask: are they duplicates/derivatives? (If yes: consider dropping or combining.)
- For classification targets, compute point-biserial correlations or mutual information between each feature and the target. Follow up with boxplots / violin plots.
- For categorical features, compute Cramér’s V and show contingency tables for the most dependent pairs.
- Compute VIF (Variance Inflation Factor) if you plan a linear model. VIF > 5 (or >10) signals multicollinearity.
Code snippet: correlation + mutual info (sklearn)
from sklearn.feature_selection import mutual_info_regression, mutual_info_classif
import numpy as np
# Pearson / Spearman
pearson = df.corr(method='pearson')
spearman = df.corr(method='spearman')
# Mutual info for regression/classification
mi_reg = mutual_info_regression(X, y_reg) # continuous target
mi_clf = mutual_info_classif(X, y_clf) # discrete target
Pitfalls, like the ones you’ll definitely make on your first project
- Spurious correlations: Two features correlate because of a confounder (time, seasonality) or pure chance. Scatter + domain sense = reality check.
- Leaked features: A variable that looks predictive only because it was generated after the target (or uses target info). You already know to avoid leakage — apply the same vigilance here.
- Ignoring nonlinearity: Pearson = 0 doesn’t mean no relation. Plot it.
- Outliers driving correlation: A single influential point can inflate r. Use robust stats or visualize.
- High-cardinality categorical features: Don’t attempt an all-pairs chi-square matrix with thousands of levels. Use target-encoding or hashing (remember feature hashing from earlier) before pairwise checks, or sample levels.
From insight to feature engineering
- If features are highly correlated and semantically redundant: combine them (sum, ratio, PCA) or drop the weaker one.
- If you see a nonlinear but consistent relationship: transform (log, Box-Cox) or add polynomial / spline terms.
- If two weak features together explain the target: add an interaction term (product, difference, ratio).
- If multicollinearity hurts interpretation (coefficients jumping all over): prefer regularization (Ridge/Lasso) or dimensionality reduction.
Example: You spot weight vs BMI vs height. Instead of keeping all three, compute the one that is most interpretable or use PCA on that trio.
A couple of advanced moves (because you’re getting greedy)
- Partial correlation: Measures correlation between A and B controlling for C. Useful for teasing apart direct vs mediated relationships.
- Hierarchical clustering of features (correlation distance): Cluster features by 1-|r|, cut tree to pick representative features from each cluster.
- Mutual information with permutation importance: Check whether the pairwise signal genuinely adds predictive power by seeing how shuffling a feature hurts a model.
Closing: the emotional arc of pairwise EDA
- Start curious. Look pretty. Question rashly.
- Replace fear of correlation with healthy skepticism: visualize everything, compute the right statistic, and always ask if the relation makes domain sense.
Key takeaways:
- Use the right correlation metric for the data types and suspected shape of relationship.
- Visual inspection + summary statistics beats blind thresholds.
- Pairwise analysis guides feature pruning, combination, transformation, and model choice — but it’s only the beginning (higher-order interactions exist).
Final commandment: Do not let a shiny high correlation seduce you into adding leaked or redundant features. Your model will love you for it, and your stakeholders will call you a wizard instead of a magician who pulled a rabbit out of the target.
Version: "Pairwise Magic — Correlations with Sass"
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!