Dimensionality Reduction and Feature Selection
Reduce redundancy and highlight signal with supervised and unsupervised techniques.
Content
Filter Methods for Feature Selection
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Filter Methods for Feature Selection — Quick, Dirty, and Actually Useful
"Feature selection is like removing the junk mail from your inbox before the machine learning model starts dating your data." — Your slightly dramatic TA
You're coming off lessons about shortcut learning, spurious correlations, and the horrors/quirks of small-n, high-d datasets (and yes, the federated learning teaser where clients can't just hand over data like it's a pizza). Filter methods are a perfect next stop: fast, scalable, and delightfully model-agnostic — but they come with caveats that quietly sabotage careless engineers.
What are Filter Methods, in a Sentence
Filter methods score each feature by a heuristic (statistical) measure and keep the top-scoring ones. They act before model training, typically independent of any classifier/regressor.
Think of them as the bouncer at your club: they check IDs (statistics), don’t care what the guests will do later on the dance floor (the model), and toss out anyone who looks suspiciously unhelpful.
Why use Filter Methods? (When you’re busy, poor, or paranoid)
- Speed — O(number of features) to compute many scores. Great for high-D settings.
- Model-agnostic — Use the same selection for many models (handy in production experiments and federated settings where you can't iterate with a central model).
- Simplicity — Easy to interpret and explain to stakeholders.
But remember previous lessons: they can’t see interactions, and they might amplify spurious correlations or shortcut learning if your data are noisy or confounded.
Common Filter Scores (Cheat Sheet)
| Method | Input Types | What it measures | When to use | Caveats |
|---|---|---|---|---|
| Variance Threshold | numerical | Low-variance (constant) features | Fast baseline | Misses useless but varying features |
| Pearson correlation | numerical vs numerical target | Linear association | Regression with linear-ish relationships | Misses nonlinear; sensitive to outliers |
| ANOVA F-test | numerical features, categorical target | Mean differences across classes | Classification with continuous features | Assumes normal-ish distributions |
| Chi-square | categorical features & target | Dependence between categories | Categorical features | Requires count data; need non-zero expected counts |
| Mutual Information | any | Any (even nonlinear) dependency | When nonlinearity matters | Needs more data; estimation variance |
| Information Gain (entropy) | categorical | Reduction in uncertainty | Classification trees prep | Sensitive to many categories |
How to actually do it — a practical pipeline
- Preprocess first: impute, encode, and scale only based on training fold to avoid leakage.
- Choose score(s) based on data types and suspected relationships (linear vs nonlinear).
- Compute scores for each feature on training data.
- Select features with either: top-k, threshold on score, or percentile.
- Validate using cross-validation: evaluate downstream model performance as you vary k.
- Check stability across folds/clients (important in federated settings).
Code-pseudocode (very small):
# Pseudocode
scores = compute_scores(X_train, y_train, method='mutual_info')
selected = top_k_features(scores, k=50)
model.fit(X_train[selected], y_train)
cv_score = cross_val_score(model, X[selected], y)
Practical Example: Genomics + Batch Effects (AKA the trapdoor)
Imagine you have gene expression data (tens of thousands of features), a disease label, and samples from two labs. One lab happened to process most sick patients — hello spurious correlation. A naive filter method (say, mutual information) will happily keep genes that separate by lab, not by disease.
So: filter methods will remove useless noise but they won't immunize you against batch effects, confounders, or shortcuts. Use domain knowledge, stratified scoring (score within-batch), or add batch-correction before scoring.
How Filter Methods Interact with Previous Topics
Shortcut learning & spurious correlation: filter scores can be hijacked by shortcuts. Always pair filtering with checks for confounding (feature vs batch, feature vs client). If a top feature correlates with a known nuisance variable, flag it.
Small data & high-D: filters are lifesavers when n << p, since wrappers/embedded methods can overfit horribly here. But beware: mutual information estimates have high variance when n is tiny.
Federated learning basics: filters are attractive in federated setups because clients can compute local scores and share only aggregated rankings or counts rather than raw data. But you must harmonize scoring (same preprocessing) and check across-client stability.
Strengths and Weaknesses (TL;DR)
Strengths:
- Fast, scalable, easy to debug.
- Model-agnostic — one set of features works across algorithms.
- Works well as an initial dimensionality reduction step (pre-filter before PCA/wrappers).
Weaknesses:
- Ignores feature interactions — two useless features together might be gold, but filters won’t see it.
- Sensitive to confounders, batch effects, and spurious correlations.
- Selection instability: different train folds may pick different features.
Tips, Tricks, and Survival Strategies
- Combine filters with a second-stage selection: use filter to reduce to a few hundred features, then use a wrapper (e.g., recursive feature elimination) or embedded method (regularized model).
- Use stability selection: bootstrapped filter + consensus features.
- In federated setups, exchange feature ranks/thresholds, not raw feature values.
- Normalize if using distance-based filters; log-transform skewed features before F-tests or correlation.
- Visual sanity check: plot top features against known nuisances (batch, client id, collection date).
Quick Decision Flow (mini flowchart in words)
- Are features mostly continuous? Use variance threshold + Pearson/ANOVA.
- Expect nonlinear signals? Add mutual information.
- Categorical features? Chi-square or information gain.
- Very high-D and tiny n? Start with variance + domain-based pruning, then filter.
Closing — The Takeaway (and a tiny pep talk)
Filter methods are your speed-demon first responder: they rescue models from death by dimensionality and give you interpretable, fast reductions. But they're not psychic — they won't save you from dataset pathologies (shortcut learning, spurious correlations) unless you do the detective work: stratified scoring, stability checks, and domain-driven sanity checks.
If you treat filter methods like blunt instruments, you'll get blunt results. Use them as smart blunt instruments: fast, explainable, and a great first pass — then iterate.
Final thought: in real-world ML, features are stories. Filter methods flag characters who seem important, but you still need to read the chapter to know why they matter.
Version notes: This lesson builds on earlier topics on dataset pitfalls (shortcut learning, small-data high-D issues) and federated concerns: use filter methods carefully in those contexts.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!