Exploratory Data Analysis for Predictive Modeling
EDA methods tailored to supervised tasks to reveal signal, distribution shifts, and modeling risks.
Content
Visualization for Class Imbalance
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Visualization for Class Imbalance — The Little Class That Could (and Often Can't)
"If your positive class is rarer than a unicorn sighting, you're not doing EDA — you're performing archeology."
You're arriving at the party having already seen how to visualize regression targets and pairwise relationships. Great — you've got context. You also just finished wrangling and feature-engineering your dataset, so features are clean, encoded, and not leaking like a sieve. Now the last obnoxious guest: class imbalance. This chapter teaches you how to visualize it so you can decide whether to resample, reweight, engineer new features, or simply be wiser about metrics.
Why this matters (quick refresher)
- Class imbalance biases learning algorithms, evaluation metrics, and even your intuition.
- Visualizations help you see how imbalanced things are, how imbalance interacts with features, and whether minority-class patterns exist or are just noise.
- Builds on pairwise relationships: instead of looking at correlations across the whole dataset, visualize them by class.
Ask yourself: "Does the minority class live in a different part of feature space, or is it completely mixed with the majority?" The answer determines strategy.
Core plots and what they tell you
1) Simple class-count bar chart (start here)
Why: the most honest picture of imbalance. Use absolute counts and percentages together.
- What to show: side-by-side bars for counts and an overlaid text with percent.
- Pitfalls: log scales hide absolute scarcity — show both linear and log if counts vary hugely.
Code snippet (seaborn/matplotlib):
import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(x='target', data=df)
for p in plt.gca().patches:
plt.gca().annotate(f"{int(p.get_height())}\n({p.get_height()/len(df):.1%})",
(p.get_x()+p.get_width()/2, p.get_height()),
ha='center', va='bottom')
plt.title('Class counts (absolute and percent)')
2) Class vs. feature distribution (numerical)
- Plot KDEs, histograms, or violin plots by class to see whether the minority class has a different distribution.
- Use transparency and same x-axis limits for fair comparison.
Why: If minority-class density overlaps heavily with majority, resampling alone may not help much — you may need feature engineering.
3) Class vs. feature distribution (categorical)
- Use stacked bar charts showing counts or proportions of categories per class.
- Mosaic plots are great when you want joint-proportions visually.
Important: use proportions (within-class) and absolute counts side-by-side — a rare class may have a strong proportion in a category but still be few in absolute terms.
4) Pairwise plots with stratified sampling
- Pairplots colored by class are excellent but O(n^2) and can get overwhelmed by many points.
- Strategy: subsample majority class to match minority count for visibility, or use alpha and hex/bin plots.
- This is where you build on the previous Pairwise Relationships topic: look at pairwise separation conditioned on class.
5) Dimensionality reduction visualizations (PCA / t-SNE / UMAP)
- Run PCA/t-SNE/UMAP on features and color points by class.
- Use this to explore separability: are minority points clustered or randomly sprinkled?
- Caveats: these techniques can distort distances — don’t over-claim causality.
6) Feature importance / class-conditional feature ranking
- Train a quick tree-based model (with cross-validation) and plot feature importances by how predictive they are for the minority class.
- This is borderline modeling, but it’s useful as a diagnostic during EDA.
7) Correlation and contingency heatmaps per class
- Compute correlation matrices for majority and minority separately and visualize differences.
- For categorical pairs, use Cramér’s V heatmaps by class to see structural differences.
Practical examples & metaphors
Imagine the minority class is a hidden speakeasy in a city: stacked bar charts tell you which neighborhoods (categories) it prefers; KDEs tell you whether it sneaks into similar price brackets as the majority; PCA shows whether it exists in one small cluster or is a bunch of people scattered across the metropolis.
If the minority is clustered in PCA space, synthetic oversampling (SMOTE variants) might work. If it's scattered and indistinguishable, oversampling could make your model hallucinate.
Visualizing sampling strategies (Before / After)
Always visualize the effect of resampling (undersample/oversample/SMOTE) on class counts and feature distributions.
- Plot counts before and after.
- Overlay feature distributions before and after — does SMOTE create unrealistic synthetic examples? If a new synthetic density looks unnaturally smooth or extends into feature regions with zero real points, be suspicious.
Code sketch:
# show distribution before and after SMOTE (example)
sns.kdeplot(data=df, x='feature1', hue='target')
# after resample: df_resampled
sns.kdeplot(data=df_resampled, x='feature1', hue='target', linestyle='--')
Pitfalls & how to avoid lying with plots
- Never plot percentages only when class counts differ drastically. A 90% majority and a 10% minority look small in % but can still be large in absolute terms.
- Avoid plotting tiny minority class with same alpha/marker size as majority — it disappears.
- Log scale can be useful but show linear version too so stakeholders understand absolute impact.
- Be careful with overplotting. Hexbin/contour or subsampling for pairplots keeps visual noise down.
Quick decision map (visualization → action)
- Minority concentrated in distinct cluster(s): consider oversampling (SMOTE variants), class-weighted loss, or targeted feature engineering for that cluster.
- Minority overlaps heavily with majority: focus on richer features, better feature transforms, or domain-specific signals rather than naive resampling.
- Minority concentrated in certain categories: create interaction features (category x numeric) or target-specific encodings.
- Resampling creates unrealistic feature support: don’t use the synthetic data blindly — consider cost-sensitive learning instead.
Checklist: What to plot during your EDA for imbalance
- Class count bar chart (absolute + percent) — always.
- KDE/violin/boxplot of top numeric features by class.
- Stacked bar / mosaic plots for important categorical features.
- Pairwise scatter (subsampled) or hex/bin plots colored by class.
- PCA / UMAP / t-SNE colored by class (with caution).
- Before/After plots for any sampling strategy you might try.
- Correlation/contingency differences between classes.
Closing mic-drop
If you only remember two things:
- Visualize counts in absolute terms, then explore conditional feature distributions by class.
- Use dimensionality reduction and pairwise plots to answer this simple but decisive question: "Do the minority cases live in a different place in feature space, or are they just fewer noisy copies of the majority?"
Do that, and you’ll skip a ton of bad modeling decisions. Go forth, plot ferociously, and never let an imbalanced dataset surprise you at the validation step.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!