Dimensionality Reduction and Feature Selection
Reduce redundancy and highlight signal with supervised and unsupervised techniques.
Content
Mutual Information for Supervised Tasks
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Mutual Information for Supervised Tasks — The Sexy Math of "How Much Does This Help?"
"Mutual information is the amount of uncertainty one variable takes away from another — like how knowing someone's coffee order predicts whether they are sleep-deprived."
You're coming in hot from Wrapper methods/RFE and Embedded methods with regularization, so you already know about searching feature subsets and letting models decide weights. Mutual Information (MI) is the elegant, slightly smug cousin in the filter family: model-agnostic, nonparametric, and great for a first pass through a fridge full of candidate features before you summon RFE or L1-regularized gladiators.
What is Mutual Information (in plain English)?
- Mutual Information (I(X; Y)) measures how much knowing X reduces uncertainty about Y. If X tells you nothing about Y, MI = 0. If X tells you everything about Y (rare), MI is maximal.
- Formally:
I(X; Y) = H(Y) - H(Y | X) = H(X) + H(Y) - H(X, Y)
where H() is entropy. For supervised tasks: X = feature, Y = target. For classification, Y is discrete; for regression, Y is continuous (so MI estimation methods differ).
Why use MI in supervised feature selection?
- Captures nonlinear relationships that correlation misses.
- Model-agnostic: no need to fit a classifier/regressor to rank features first.
- Works for mixed variable types (with appropriate estimators).
- Fast enough for an initial screening in high-dimensional data.
But like any sassy tool, it has caveats (coming up).
How to compute it (practical summary)
- Discrete Y (classification): you can discretize continuous X or use discrete estimators. Scikit-learn provides
mutual_info_classifwhich uses a KNN-based estimator. - Continuous Y (regression):
mutual_info_regression(also KNN-based) estimates MI via nearest-neighbor statistics (Kraskov-style estimators). - Gaussian special case: if (X, Y) are jointly Gaussian, MI relates to Pearson correlation ρ:
I(X;Y) = -0.5 * log(1 - ρ^2)
Useful sanity check: if correlation is tiny but MI is big, there's nonlinearity; if both are small, likely useless feature.
Code snippet (scikit-learn):
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
mi = mutual_info_classif(X, y, discrete_features='auto', n_neighbors=3, random_state=0)
# or
mi_reg = mutual_info_regression(X, y_continuous, n_neighbors=3, random_state=0)
Parameters to watch:
n_neighbors: bias/variance tradeoff for the estimator. Fiddle with 3–10.discrete_features: mark categorical features so estimator handles them properly.random_state: estimator uses randomness.
Intuition + tiny examples
- If feature X is age group and Y is disease (binary), MI tells you how much knowing the age group reduces the uncertainty of disease status — beyond linear odds.
- If X is a sine of time and Y is power usage, MI can be high even if Pearson correlation ~0 (nonlinear).
Ask yourself: "If I were handed X, how surprised would I still be about Y?" MI = surprise reduction.
Pitfalls, practical gotchas, and how they relate to prior topics
- Sample size matters: KNN-based MI estimators are biased with small n. If your dataset is tiny or heavily imbalanced (recall our "Handling Real-World Data Issues" talk), MI may understate usefulness. Use bootstrapping or permutation tests to calibrate.
- Noise & drift: noisy features reduce MI (obvious). Under concept drift, MI ranking can change — monitor MI over time or compute conditional MI with recent windows.
- Redundancy: MI(X_i; Y) doesn't account for overlap between features. Two features each with high MI may be redundant. This is where mRMR (minimum Redundancy, Maximum Relevance) helps: combine MI with redundancy penalties.
- Conditional dependencies: Sometimes a feature is only informative when combined with another. Pairwise MI misses interactions — wrapper methods or conditional mutual information are needed.
Relation to what you learned before:
- RFE/Wrapper: these capture conditional/interaction effects because they fit models. Use MI for initial screening to reduce feature count before RFE.
- Embedded (L1): picks features that help a specific model. MI is model-agnostic and can find different signals (especially nonlinear ones) that L1 might miss.
Advanced-ish strategies (how to use MI in a pipeline)
- Screen: Use MI to drop obviously dead features (low MI) — cheap and effective for thousands of features.
- De-redundify: Apply mRMR or greedy selection using MI to penalize redundancy.
- Refine: Run RFE or L1-regularized models on the reduced set — now the expensive wrapper/embedded methods are feasible.
- Monitor: In production, track MI over time for drift detection and periodically re-run selection.
Pseudo-code for a simple mRMR greedy loop:
selected = []
while len(selected) < k:
best_feature = argmax_{f not in selected} [ MI(f, Y) - (1/|selected|) * sum_{s in selected} MI(f, s) ]
selected.append(best_feature)
This favors features that are relevant to Y and non-redundant with already chosen features.
Quick comparison (table)
| Method | Nonlinear? | Considers redundancy | Model-agnostic | Cost |
|---|---|---|---|---|
| Mutual Information (filter) | Yes | No (unless mRMR) | Yes | Low–Medium |
| Pearson correlation | No | No | Yes | Very Low |
| RFE (wrapper) | Yes (if model is) | Yes (via model) | No | High |
| L1 (embedded) | Only linear sparsity | No | No | Medium |
Rules of thumb / Checklist
- Use MI for quick screening in high-dimensional settings.
- Scale continuous features before KNN-based MI (distance-sensitive).
- For imbalanced classification, use stratified subsampling or weighting when estimating MI.
- Combine MI with redundancy control (mRMR) to avoid selecting 10 clones of the same signal.
- Validate MI-chosen features by training a model and using cross-validated performance or permutation importance.
Final takeaways (the heroic one-liners)
- Mutual Information =
How much does this feature reduce my uncertainty about the target?Great for catching nonlinear signals that correlation misses. - Not a panacea: It’s a superb first pass — but pair it with redundancy control and follow up with model-based selection.
- Production tip: Monitor MI over time as a lightweight drift detector: if MI drops for a formerly informative feature, something changed upstream.
Use MI to prune the jungle, but bring RFE and L1 into the arena for fine fighting.
Further reading: Kraskov et al. (KNN MI estimators), Peng et al. (mRMR). If you want, I’ll give you a plug-and-play snippet that runs MI -> mRMR -> RFE on your dataset and prints the feature audition results along with drift checks. Want it?
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!