Dimensionality Reduction and Feature Selection
Reduce redundancy and highlight signal with supervised and unsupervised techniques.
Content
Principal Component Analysis
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Principal Component Analysis — PCA but Make It Dance
"PCA: turning messy, crowded data into a tasteful, minimalist party where every guest actually contributes." — Your slightly dramatic ML TA
Hook: Remember when you pruned features based on correlation and mutual information?
You already learned how to yank out boring twins (correlation-based pruning) and how to check whether a feature whispers anything useful about the label (mutual information). PCA is the next trick in the kit — but it’s a different animal. Instead of choosing features, PCA creates new ones: linear combos of the originals that capture the most variance.
Why this matters now: after you handled noise, imbalance, and pruning correlated junk, you still might have dozens — or hundreds — of features that are noisy, redundant, or just plain inconvenient for modeling and visualization. PCA helps compress that space while retaining structure. But it’s also a blunt instrument: it sacrifices original interpretability for compactness.
What PCA actually does (the short, dramatic version)
- Take your centered data matrix X (rows = samples, cols = features).
- Find orthogonal directions (principal components) that capture maximal variance.
- Project data onto the top k directions to reduce dimensionality.
In math-speak (read like a recipe):
1. X_centered = X - mean(X, axis=0)
2. C = cov(X_centered) = (X_centered.T @ X_centered) / (n-1)
3. Compute eigenvectors (V) and eigenvalues (Λ) of C
4. Sort eigenvectors by eigenvalues (descending)
5. Project: X_pca = X_centered @ V_k # keep top k eigenvectors
Alternative: do SVD on X_centered directly (numerically more stable): X_centered = U Σ V^T, then principal directions = columns of V.
Intuition — think of PCA like choosing camera angles
Imagine your high-dimensional dataset is a sculpture in a foggy gallery. You can't see all of it at once, and most photos you take are redundant. PCA finds the camera angles (orthogonal directions) that capture the sculpture's most dramatic shapes (variance). The first photo (PC1) captures the biggest silhouette; the second (PC2) captures the biggest remaining silhouette orthogonal to the first, and so on.
Ask yourself: "Do I want a faithful photograph (retain variance) or a labeled explanation (retain label info)?" If the latter, consider supervised feature selection (like mutual information) or supervised dimensionality reduction.
Practical checklist — how to apply PCA without accidentally sabotaging your model
- Standardize features: If features are on different scales, PCA will be biased toward large-scale features. Use StandardScaler (zero mean, unit variance) unless you have reason not to.
- Handle missing values first: PCA assumes complete data. Impute or use algorithms that support missingness (or iterative PCA).
- De-noise if necessary: Outliers and high noise can warp principal directions. Consider robust scaling or trimming.
- Choose k with care: Use explained variance ratio, scree plots, or cross-validate downstream model performance.
- Avoid label leakage: If PCA is fit on the whole dataset before train-test split, you leak information. Fit PCA on training only and apply transform to validation/test.
- Validate downstream: PCA is not guaranteed to improve model performance — test it!
When PCA is your friend — and when it’s not
Pros:
- Great for compression and visualization (2–3 components for plotting clusters).
- Removes linear redundancy, which can help algorithms sensitive to collinearity (e.g., linear regression).
- Fast and deterministic; scalable with incremental or randomized SVD for large data.
Cons / Pitfalls:
- Unsupervised: PCA ignores the label. The directions of highest variance might be irrelevant for predicting the target.
- Loses interpretability: Components are linear mixes of features — harder to explain to stakeholders.
- Sensitive to scaling and outliers.
- Not ideal for nonlinear structure — use kernel PCA, t-SNE, or UMAP if manifold structure matters.
PCA vs Feature Selection (Correlation pruning, Mutual Information)
| Method | Uses labels? | Keeps original features? | Good for | Downsides |
|---|---|---|---|---|
| Correlation pruning | No (or weakly supervised with label corr) | Yes | Removing duplicates/multicollinearity | Ignores joint information across many features |
| Mutual information | Yes | Yes | Finding features with predictive info | Needs sufficient data; univariate measures miss multivariate synergies |
| PCA | No (unsupervised) | No (creates combos) | Compression, visualization, reducing multicollinearity | Can reduce predictive power if label is orthogonal to top variance |
Use both: prune blatantly redundant features first (correlation), test mutual info for important predictors, and apply PCA when you need compact representations or visualization.
Advanced: Variants & real-production considerations
- Incremental PCA: For streaming or very large datasets — updates components without loading all data.
- Kernel PCA: Captures non-linear structure via kernels — useful when clusters lie on curves/manifolds.
- Sparse PCA: Tries to produce components that involve fewer original features — partially restores interpretability.
- Robust PCA: Separates low-rank structure from sparse noise (great if you have gross corruptions).
Operational tips (linking back to "handling real-world data issues"):
- If data drift occurs, principal directions can rotate; monitor explained variance and retrain PCA periodically.
- For imbalanced labels, remember PCA doesn’t fix imbalance — combine with SMOTE or class-weighting for classification.
- If noise dominates, PCA might capture noise variance; denoise first or choose components based on signal-to-noise considerations.
Quick example (sketch in Python)
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
pca = PCA(n_components=0.95) # keep 95% variance
X_train_pca = pca.fit_transform(X_train_scaled)
X_val_pca = pca.transform(scaler.transform(X_val))
# then feed X_train_pca into your model
Remember: fit scaler and PCA on training data only.
Closing — a helpful mantra and key takeaways
"PCA compresses variance, not truth — check if that variance is the part of truth you actually need."
Key takeaways:
- PCA is unsupervised compression: it optimizes for variance, not predictive power.
- Standardize & avoid leakage: always fit transforms on training data only.
- Combine tools: use PCA after correlation pruning or as part of a pipeline that also includes supervised feature selection (mutual information) and robust preprocessing.
- Monitor in production: components change with drift; use incremental PCA or retrain periodically.
Final thought: use PCA like a stylist — it can make your dataset sleeker and easier to work with, but don’t let it dress your model in a costume that hides the thing you’re trying to predict. Experiment, validate, and when in doubt, plot it — humans are still pretty good at spotting useful structure in 2D.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!