Dimensionality Reduction and Feature Selection
Reduce redundancy and highlight signal with supervised and unsupervised techniques.
Content
Wrapper Methods and RFE
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Wrapper Methods & RFE — The Slow-Cook Approach to Feature Selection
"If filter methods are the quick salad of feature selection, wrapper methods are the slow-smoked brisket: they take longer, taste better for the specific recipe, and might bankrupt you if you're careless."
You're coming fresh from Filter Methods for Feature Selection (nice work — you learned how to toss out obviously trashy features quickly). You've also seen the horrors of shortcut learning, spurious correlations, and the cursed regime of small data + high dimensionality. Good. We're now digging into wrapper methods — especially Recursive Feature Elimination (RFE) — which sit between the brute-force speed of filters and the model-integrated elegance of embedded methods.
What are wrapper methods, in human terms?
- Wrapper methods treat the model as a black box and ask: "Which subset of features makes this particular model perform best?"
- They wrap the learning algorithm around a search through feature subsets and evaluate model performance (usually via cross-validation) to pick winners.
Why this matters after filters and real-world headaches: filters are fast but oblivious to the model's inductive biases. Wrappers consider the model directly — useful when feature interactions matter (e.g., two weak features combined are gold), or when shortcut learning might mislead a simple filter.
RFE: Recursive Feature Elimination — the recursive prune-and-judge ritual
What it is: RFE starts with all features (or a large set) and repeatedly:
- Train the model on current features
- Rank features by some importance score from the model
- Remove the least important feature(s)
- Repeat until you hit the target number of features
It’s a greedy backward-elimination strategy — prune the twig that looks weakest, retrain, repeat. Pretty dramatic, but effective.
Pseudocode (RFE core):
procedure RFE(estimator, X, y, n_features_to_select, step=1):
features = all feature indices
while len(features) > n_features_to_select:
fit estimator on X[:, features]
importances = estimator.feature_importances_ or coef_
remove the `step` features with smallest importances
features = updated features
return features
- step: number of features to drop per iteration (higher = faster, coarser)
- estimator: must provide some way to rank features (coefficients, feature_importances_, or permutation importances)
Variants & niceties
- RFECV: RFE + cross-validation to choose the optimal number of features automatically. Expect heavier compute time but better guardrails.
- Step size: removing many features in a step speeds things up but can skip over a near-optimal subset. Remove 1–5% of features at a time for balance.
- Estimator choice: Use a stable, deterministic estimator if you want reproducible rankings. Tree ensembles and linear models are common; beware of randomness unless fixed with seeds.
- Scoring: Use an appropriate CV scoring metric (AUC, F1, R2) — especially important when class imbalance or regression quirks are present (remember our earlier discussion on imbalance and shortcut learning).
When wrapper methods (and RFE) shine
- You suspect feature interactions that filters miss.
- You have a specific model you plan to deploy and want features tuned to it.
- You can afford compute or can pre-filter (use filter methods first) to cut candidate number.
When not to use them: extremely high-dim data without pre-filtering (they’ll be slow), or when model interpretability requires global feature importance across many models.
Real-world examples (so it feels real)
- Genomics: thousands of SNPs; filters (e.g., variance/chi-square) narrow to a few thousand, then RFE with an SVM or logistic regressor finds the biologically relevant subset. Caveat: high chance of overfitting on small cohorts — use nested CV.
- Text features: after TF-IDF pruning (filters), use RFE with a linear classifier to select n-grams that cooperate to predict sentiment.
- Sensors/IoT: dozens of signals — RFE with tree ensembles can reveal which sensors are redundantly providing the same information (and which combinations predict failures).
Pitfalls & gotchas (the parts that bite you at 2 a.m.)
- Computational cost: RFE trains many models. If your estimator is expensive, prepare your wallet and GPU.
- Overfitting: Wrapper methods can overfit to noise if you tune feature subsets on a single train/test split. Always use cross-validation, and prefer nested CV when comparing different selectors or hyperparameters.
- Feature correlation: Highly correlated features can flip importance rankings across folds. The result: an unstable selected set. Check stability.
- Estimator bias: Feature importance measures vary. Random forest importances favor high-cardinality categorical features; linear coeffs are sensitive to scaling.
Practical checklist / Best practices
- Pre-filter: Use variance threshold or univariate filters to remove obviously useless features before RFE. This saves hours and sanity.
- Scale features if your estimator needs it (e.g., SVM, logistic regression).
- Use RFECV or nested CV to avoid optimistic bias when selecting number of features.
- Fix randomness in your estimator or repeat RFE multiple times and average results to test stability.
- Inspect correlated groups: if several correlated features are interchangeable, consider group selection or domain-informed collapsing.
- Consider embedded methods (L1, tree-based) as alternatives or sanity checks.
- Use permutation importance or SHAP after selection to validate why features were kept.
Quick comparison table
| Type | Speed | Model-aware? | Handles interactions? | Good for high-dim? |
|---|---|---|---|---|
| Filter | Fast | No | No | Yes (cheap) |
| Wrapper (RFE) | Slow | Yes | Yes | Only with pre-filtering |
| Embedded | Medium | Yes (built-in) | Sometimes | Often yes |
Final one-liner (to remember while you write code at 3AM)
Use filters to trim the forest, wrappers to prune the tree you plan to live under, and nested CV to make sure your pruning wasn't just dramatic overfitting.
Key takeaways
- RFE is powerful because it optimizes feature subsets for a specific model and can capture interactions filters miss.
- It's computationally heavy and can overfit; combat this with CV, nested CV, and pre-filtering.
- Check stability and validate selections with independent explanations (permutation importance, SHAP).
Go forth and eliminate features responsibly. Your model — and your energy bill — will thank you.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!