Dimensionality Reduction and Feature Selection
Reduce redundancy and highlight signal with supervised and unsupervised techniques.
Content
Correlation-Based Feature Pruning
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Correlation-Based Feature Pruning — The Lazy-but-Effective Feature Diet
"If two features are whispering the same secret, one of them can go nap and the model won't notice. But watch out — sometimes the whisper hides a plot twist."
You already met mutual information (position 4) — the scrappy detective that sniffed out non-linear signal between features and the target. And you remember embedded regularization (position 3) — the gladiator that punished irrelevant coefficients during training. Now meet the middle sibling: Correlation-Based Feature Pruning. It's fast, interpretable, and a little blunt. Perfect when your dataset is messy from our previous discussion on handling noise, drift, and imbalance.
Why correlation pruning matters (and when to use it)
- You have lots of features and limited compute.
- You're fighting multicollinearity that wrecks coefficient interpretability (hello, linear models!).
- You want a quick, deterministic preprocessing step before mutual-information checks or regularized modeling.
Correlation pruning is not a magic cure. It's a pragmatic filter: cheap, explainable, and often surprisingly effective at cleaning obvious redundancy. But if relationships are non-linear or subtle, use it as a first pass, not the final judge.
The basic idea (duh)
- Compute pairwise correlations among features.
- When two features are strongly correlated, prune one (or combine them).
- Optionally, enforce low correlation between retained features and the retained target-irrelevant ones.
Important correlation flavors
- Pearson correlation: linear relationships between continuous variables. Use for continuous-continuous pairs.
- Spearman correlation: rank-based; catches monotonic but non-linear relationships.
- Point-biserial / phi coefficient: for continuous vs binary, and binary vs binary respectively.
Choose the measurement to match variable types — mixing them blindly is a common rookie sin.
A crisp step-by-step algorithm (what to actually run)
- Prepare: Impute missing values, encode categoricals sensibly (target encoding can leak — don't!), and scale if you care about distances.
- Compute correlation matrix using the appropriate methods for the variable types. For mixed data, consider a hybrid approach or rank correlations.
- Threshold: Choose a correlation cutoff (e.g., |r| > 0.8). Pairs above the cutoff are candidates for pruning.
- Choose which to drop using heuristics: lower mutual information with the target, higher missing rate, worse predictive power in univariate models, or lower domain importance.
- Validate: Train a simple model before/after pruning. Monitor performance, stability, and coefficient changes.
Heuristics for picking who stays
- Keep the feature with higher mutual information with the target (you already have that tool — use it!).
- Prefer the feature with lower missingness.
- Prefer the feature with lower measurement noise (from your analytics on data quality—remember Handling Real-World Data Issues).
- Prefer a feature that is easier to explain to stakeholders.
Tip: If two features are equally good, prefer the one you can explain to your product manager. Fewer follow-up emails.
Advanced twist: clustering correlated features
Instead of greedy pairwise dropping, build a correlation distance matrix (1 - |r|), run hierarchical clustering, and cut the dendrogram at a desired height. This groups features into clusters of redundancy; then pick a representative from each cluster (e.g., highest mutual info, lowest missingness).
Table: Quick method comparison
| Method | Pros | Cons |
|---|---|---|
| Pairwise thresholding | Fast, simple | Sensitive to which one you drop first |
| Clustering + representative | More stable, group-wise | Slightly more compute, hyperparameter (cut height) |
| VIF-based removal | Targets multicollinearity for linear models | Assumes linearity, can be cyclical |
Watch-outs & practical gotchas
- Non-linear redundancy: Pearson misses it. Use Spearman or mutual information when you suspect monotonic or non-linear ties.
- Target leakage: If you encode categorical features using target data before splitting, correlation measures leak. Compute correlations only on the training set.
- Time-series / concept drift: Correlations can change over time. Recompute periodically in production (you covered drift earlier — now apply it here).
- Categorical variables: One-hot expands columns; correlated dummies can be everywhere. Consider grouping or using embeddings.
- Interactions & derived features: Removing one feature might kill an interaction term’s usefulness. If downstream models use interactions heavily, be conservative.
Quick pseudo-Python recipe
# Pseudocode / sketch (pandas + scipy/sklearn)
import pandas as pd
from scipy.stats import spearmanr
X = train_features.copy()
# 1) Impute/encode
# 2) Compute correlation matrix (e.g., Spearman for robustness)
corr = X.corr(method='spearman').abs()
# 3) Find upper triangle pairs above threshold
upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))
threshold = 0.8
to_drop = [col for col in upper.columns if any(upper[col] > threshold)]
# 4) Use mutual information to choose between pairs (or drop to_drop)
# ... then retrain simple model to validate
(Real code should handle mixed data types, compute MI scores, and avoid leakage.)
Example: House prices and the sneaky sqft twins
You have: total_area, living_area, num_rooms, bedrooms. Total_area and living_area have |r| = 0.92. Mutual info with target: total_area (0.45), living_area (0.44). But living_area has many missing entries. Decision: prune living_area, keep total_area. Result: model coefficients stabilize, training time drops, and interpretability improves.
Ask yourself: "If I remove this feature, does predictive skill drop?" If yes — pause. If no — prune with a smug smile.
Closing — When to prune and when to chill
- Use correlation-based pruning as a fast, interpretable first pass after cleaning and before heavier methods (mutual information checks, regularized embedded methods).
- Pair it with mutual information: correlation tells you redundancy; MI tells you predictive value. Use both.
- Re-evaluate in production: drift may resurrect dropped features or bury kept ones.
Key takeaways:
- Correlation pruning = speed + simplicity, not omniscience.
- Match correlation metric to data types (Pearson vs Spearman vs phi).
- Pick drop candidates by predictive usefulness and data quality, not just raw correlation.
Final mic-drop: Pruning features is like pruning a bonsai — don't chop on impulse. Remove slowly, validate often, and keep the shape elegant.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!