Dimensionality Reduction and Feature Selection
Reduce redundancy and highlight signal with supervised and unsupervised techniques.
Content
Embedded Methods with Regularization
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Embedded Methods with Regularization — The Swiss Army Knife of Feature Selection
"Feature selection that sneaks into training like it pays rent — efficient, practical, and slightly smug."
You already met filter methods (quick, cheap heuristics) and wrapper methods/RFE (exhaustive, accurate-ish, and computationally hungry). Now it’s time to introduce the in-between hero: embedded methods, especially those that use regularization (L1, L2, Elastic Net). These methods fold feature selection into model training itself — elegant, practical, and usually faster than wrappers for real-world problems.
Why embedded methods? Quick refresher context
- Filter methods rank features with independent criteria (e.g., mutual information) — fast but oblivious to the model.
- Wrapper methods (like RFE) search subsets by repeatedly training models — accurate but slow and fragile with noisy data.
Embedded methods: Model learns parameters and discards/penalizes features at the same time. They're a middle ground: model-aware like wrappers, but far more computationally efficient because selection happens during training.
They’re particularly attractive when you’ve already wrestled with real-world data issues — noise, drift, imbalance — because regularization provides both shrinkage (robustness) and simplicity (sparser models that generalize better).
The core idea (math light, intuition heavy)
Regularization adds a penalty to the loss function to discourage complex models.
- Ordinary least squares minimizes: L = sum((y - Xw)^2)
- With regularization: L = sum((y - Xw)^2) + alpha * penalty(w)
Common penalties:
- L2 (Ridge): penalty(w) = ||w||_2^2 — shrinks coefficients but rarely makes them exactly zero.
- L1 (Lasso): penalty(w) = ||w||_1 — encourages sparsity; many coefficients become exactly zero (feature selection!).
- Elastic Net: mixture of L1 and L2 — balances sparsity and stability when features are correlated.
Think of Ridge as weight loss coach that tells all weights to shrink proportionally; Lasso is the harsh editor who cuts whole words (features) out of the manuscript.
Comparison table (so your future self can stop guessing)
| Method | Effect on coefficients | Good when... | Drawbacks |
|---|---|---|---|
| Ridge (L2) | Shrinks, rarely zero | Multicollinearity, many small contributing features | Doesn't select features — not sparse |
| Lasso (L1) | Sparse (exact zeros) | When you want feature selection and interpretability | Unstable with correlated features; can pick one arbitrarily |
| Elastic Net | Sparse + stable | Many correlated features; need compromise between L1 and L2 | Two hyperparameters to tune (alpha & l1_ratio) |
Practical tips — how to use embedded regularization correctly
- Always scale your features (StandardScaler) before penalties based on coefficient magnitude. L1/L2 assume commensurate feature scales.
- Wrap selection inside cross-validation: feature selection must happen inside each CV fold (use sklearn Pipelines) — otherwise you leak information and inflate performance.
- Tune alpha (regularization strength) with CV (LassoCV, ElasticNetCV) — not by eyeballing. Too high → underfit, too low → no selection.
- Watch correlated features: Lasso may arbitrarily pick one. Use Elastic Net or Group Lasso if groups of features should be selected together.
- Check stability: run selection across bootstrap samples; unstable features → be skeptical.
- Combine with filters if you have tens of thousands of features: do a cheap filter to reduce dimensionality, then apply embedded methods.
Code snippet (scikit-learn style):
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import ElasticNetCV
from sklearn.model_selection import cross_val_score
pipe = Pipeline([
('scaler', StandardScaler()),
('en', ElasticNetCV(cv=5, l1_ratio=[.1, .5, .9], alphas=10))
])
scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
print(scores.mean())
To extract selected features:
pipe.fit(X_train, y_train)
coef = pipe.named_steps['en'].coef_
selected = X.columns[coef != 0]
Real-world redux: handling noise, drift, imbalance — how regularization helps (and where it fails)
- Noise: Regularization shrinks noisy coefficients, improving generalization. Lasso can eliminate noisy features outright.
- Drift: If distribution changes, a smaller, robust model is easier to monitor and retrain. But regularization won’t fix concept drift — you still need drift detection and periodic retraining.
- Imbalance: Regularization doesn't directly solve class imbalance. Combine with class weighting, resampling, or metrics that reflect imbalance. For classification with L1-penalized logistic regression, use class_weight or sample weights.
Caveat: If features are noisy and correlated, Lasso might keep the wrong one. Elastic Net or domain-driven grouping helps.
More embedded flavors (don't put everything in a single box)
- Tree-based models (RandomForest, GradientBoosting) provide feature importances during training. Not sparsity in coefficients but usable for selection. They handle nonlinearity and interactions out of the box.
- Regularized neural nets: L1/L2 penalties on weights or dropout achieve implicit selection/shrinkage — but extracting interpretable selected features is harder.
- Group Lasso: if features come in logical groups (e.g., one-hot encodings), group penalties select entire groups.
Quick recipe — Production-ready pipeline
- Exploratory data check: correlations, missing values, distributions.
- Simple filter to drop obviously useless features (variance threshold, domain rules).
- Pipeline with StandardScaler + ElasticNetCV (or LassoCV) wrapped in cross-validation.
- Stability check: bootstrap selection frequency; if a feature is selected < X% of times, consider removing.
- Monitor performance and feature distribution in production — automated alerts for drift.
- Retrain schedule: more frequent when features drift often. Keep model simple: less fragile.
Final thoughts — the life lesson in regularization
Embedded methods with regularization are the pragmatic middle child: model-aware like wrappers, efficient like filters. They reduce overfitting, enhance interpretability, and make models easier to maintain in production — but they’re not magic. Mind your preprocessing, guard against leakage, and remember: stability > novelty.
"A sparse model is not just tidy — it’s survivable in the wild."
Key takeaways:
- Lasso = sparsity; Ridge = shrinkage; Elastic Net = best of both when features are buddies (correlated).
- Always scale, tune, and embed selection inside CV.
- Regularization helps with noise and simplifies monitoring for drift, but does not replace explicit drift handling or imbalance strategies.
Now go forth and regularize like a responsible ML citizen. Your production pipeline — and on-call future you — will thank you.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!