Exploratory Data Analysis for Predictive Modeling
EDA methods tailored to supervised tasks to reveal signal, distribution shifts, and modeling risks.
Content
Univariate Distributions and Summary Stats
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Univariate Distributions & Summary Stats — The Sexy First Date of Your Features
"If you skip univariate EDA, your model will judge you in subtle, career-limiting ways." — Probably me, but also your model
You're coming off a sprint through Data Wrangling and Feature Engineering — you tamed high-cardinality beasts with feature hashing, debated sparse vs dense like a caffeinated philosopher, and built features that actually mean something without leaking the answers. Now, before you hand your lovingly engineered features to a hungry algorithm, we need the unglamorous but essential ritual: Univariate Exploratory Data Analysis (EDA).
This is where each feature gets a solo performance. We ask: who are you? How do you behave? Are you lying to me? Will you explode my model if I standardize you? Let's find out.
Why univariate EDA matters (and why your future self will send you a thank-you meme)
- Catch garbage early: Skew, heavy tails, or a pile of zeros can torpedo assumptions behind linear models, distance metrics, and many preprocessing steps. Remember when you hashed high-cardinality categories and got a bunch of sparse columns? Those sparsity patterns deserve a univariate check too.
- Guide transformations: Log, sqrt, Box–Cox? You won’t know until you inspect the distribution.
- Robust scaling choices: Mean/SD vs median/IQR — pick your fighter based on the distribution.
- Feature importance sanity check: A constant or near-constant feature is noise. A highly skewed feature might dominate a distance-based model.
The toolkit: What to compute and why
1) Core summary statistics (the essentials)
- Count / n: How many non-missing observations? (Don’t forget missingness — it’s information)
- Mean: Average. Sensitive to outliers.
- Median: Middle value. Robust.
- Std (σ): Spread around the mean. Use cautiously when skewed.
- IQR (Q3 − Q1): Spread of the middle 50%. Robust.
- Min / Max: Show the range and potential data-entry errors.
- Percentiles (e.g., 1st, 5th, 95th, 99th): Help detect heavy tails.
- Skewness: Direction and degree of asymmetry.
- Kurtosis: Tail heaviness (not just “peakedness”).
Why both mean and median? Because if mean ≫ median, you’ve got a right tail stretching like a bad plotline.
2) Robust measures and outlier detectors
- MAD (Median Absolute Deviation): Robust analog of standard deviation.
- IQR-based rule: Outlier if x < Q1 − 1.5·IQR or x > Q3 + 1.5·IQR.
- Robust z-score: (x − median) / MAD.
3) Visuals (your eyes are powerful validators)
- Histogram + KDE: Shape, modality, tails.
- Boxplot (with notches): Quick outlier view; medians & IQR.
- Violin plot: If you like drama and density.
- ECDF (Empirical CDF): Great for comparing distributions.
Quick Python cheatsheet (pandas + seaborn vibes)
# pandas summary
df['age'].describe(percentiles=[.01, .05, .25, .5, .75, .95, .99])
# skew/kurt
df['age'].skew(), df['age'].kurtosis()
# IQR and IQR-outliers
Q1 = df['age'].quantile(0.25)
Q3 = df['age'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['age'] < Q1 - 1.5*IQR) | (df['age'] > Q3 + 1.5*IQR)]
# MAD
mad = (np.abs(df['age'] - df['age'].median())).median()
robust_z = (df['age'] - df['age'].median()) / (1.4826 * mad)
# quick plot
import seaborn as sns
sns.histplot(df['age'], kde=True)
Examples & interpretation (read like a drama script)
Scenario A — Income is wildly right-skewed
- Mean ($120k) ≫ median ($45k), heavy right tail.
- Model implication: A linear model might be pulled toward the wealthy outliers.
- Fixes: Log transform, winsorize the top 1%, or use tree-based models that are less sensitive to monotonic transformations.
Scenario B — A predictor is almost always zero
- 95% zeros, 5% positive values (sparse)
- If you created this via feature hashing or one-hot expansion, this is expected. But: remove near-constant features or compress them (sparse format benefits). Consider binary encoding: present/absent.
Scenario C — Numeric column with two peaks (bimodal)
- Could represent two distinct populations (e.g., novice vs expert users).
- Consider: splitting into two features, adding an interaction, or binning into categories.
Rules of thumb (do not ignore these)
- If skewness magnitude > 1: consider transformation.
- If kurtosis >> 3: inspect tail percentiles (95/99) before trusting mean/SD.
- If > 90% identical values: drop or re-encode — it won't help supervised learning.
- If missingness correlates with target: create a missing indicator — missingness can be predictive.
Table: Choosing central tendency & spread — quick lookup
| Situation | Use central tendency | Use spread measure | Why |
|---|---|---|---|
| Symmetric, light tails | Mean | Std | Efficient for Gaussian-like data |
| Skewed | Median | IQR / MAD | Robust to outliers |
| Heavy tails | Median | MAD / Percentiles | Captures extreme behavior without distortion |
| Sparse with zeros | Median or rate | Proportion non-zero + IQR | Zero inflation needs special handling |
Practical checklist before modeling (your pre-flight inspection)
- For each numeric feature: compute count, missing%, mean, median, std, IQR, skew, kurtosis, 1/99 percentiles.
- Visualize with histogram + boxplot (or violin). Spot-check distributions across target classes.
- If extreme skew or heavy tails: try log/Box–Cox/Yeo–Johnson; re-evaluate.
- Mark near-constant features for removal or special encoding.
- For sparse features (e.g., after hashing or one-hot): consider sparse matrices and check density; aggregate rare levels.
- Create a small transformation pipeline (fit on train only) and test its effect on validation performance.
Final pep talk + key takeaways
Univariate EDA isn’t decorative — it’s the stabilizer that keeps your predictive models from lurching into nonsensical behavior. It's the difference between a model that generalizes and one that memorizes weird artifacts (or screams at your validation set). You've already learned to wrestle high-cardinality monsters and choose between sparse and dense representations; now look at the face of every feature and ask it the important questions:
- Are you skewed? Transformable?
- Are you a tiny but predictive minority (sparsity)?
- Do you hide missingness that’s actually a signal?
Do these checks early and document your decisions. Your future reproducible-self (and whoever inherits your notebook) will thank you — possibly with a GIF.
TL;DR: Summarize, visualize, decide. Mean vs median is not an aesthetic choice — it's a battle plan.
version_notes: "Builds on prior feature-engineering lessons like hashing and sparse/dense choices."
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!