Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

Filter Methods for Feature Selection Wrapper Methods and RFE Embedded Methods with Regularization Mutual Information for Supervised Tasks Correlation-Based Feature Pruning Principal Component Analysis PCA for Preprocessing Pipelines Sparse PCA and Kernel PCA Linear Discriminant Analysis t-SNE and UMAP for Exploration Autoencoder Features Overview Variance Thresholding Stability Selection Techniques Feature Selection under Imbalance Interpreting Reduced Dimensions

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Dimensionality Reduction and Feature Selection

Dimensionality Reduction and Feature Selection

23212 views

Reduce redundancy and highlight signal with supervised and unsupervised techniques.

Content

6 of 15

Principal Component Analysis

PCA but Make It Dance

4955 views

intermediate

humorous

visual

machine learning

gpt-5-mini

4955 views

Versions:

PCA but Make It Dance

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Principal Component Analysis — PCA but Make It Dance

"PCA: turning messy, crowded data into a tasteful, minimalist party where every guest actually contributes." — Your slightly dramatic ML TA

Hook: Remember when you pruned features based on correlation and mutual information?

You already learned how to yank out boring twins (correlation-based pruning) and how to check whether a feature whispers anything useful about the label (mutual information). PCA is the next trick in the kit — but it’s a different animal. Instead of choosing features, PCA creates new ones: linear combos of the originals that capture the most variance.

Why this matters now: after you handled noise, imbalance, and pruning correlated junk, you still might have dozens — or hundreds — of features that are noisy, redundant, or just plain inconvenient for modeling and visualization. PCA helps compress that space while retaining structure. But it’s also a blunt instrument: it sacrifices original interpretability for compactness.

What PCA actually does (the short, dramatic version)

Take your centered data matrix X (rows = samples, cols = features).
Find orthogonal directions (principal components) that capture maximal variance.
Project data onto the top k directions to reduce dimensionality.

In math-speak (read like a recipe):

1. X_centered = X - mean(X, axis=0)
2. C = cov(X_centered) = (X_centered.T @ X_centered) / (n-1)
3. Compute eigenvectors (V) and eigenvalues (Λ) of C
4. Sort eigenvectors by eigenvalues (descending)
5. Project: X_pca = X_centered @ V_k  # keep top k eigenvectors

Alternative: do SVD on X_centered directly (numerically more stable): X_centered = U Σ V^T, then principal directions = columns of V.

Intuition — think of PCA like choosing camera angles

Imagine your high-dimensional dataset is a sculpture in a foggy gallery. You can't see all of it at once, and most photos you take are redundant. PCA finds the camera angles (orthogonal directions) that capture the sculpture's most dramatic shapes (variance). The first photo (PC1) captures the biggest silhouette; the second (PC2) captures the biggest remaining silhouette orthogonal to the first, and so on.

Ask yourself: "Do I want a faithful photograph (retain variance) or a labeled explanation (retain label info)?" If the latter, consider supervised feature selection (like mutual information) or supervised dimensionality reduction.

Practical checklist — how to apply PCA without accidentally sabotaging your model

Standardize features: If features are on different scales, PCA will be biased toward large-scale features. Use StandardScaler (zero mean, unit variance) unless you have reason not to.
Handle missing values first: PCA assumes complete data. Impute or use algorithms that support missingness (or iterative PCA).
De-noise if necessary: Outliers and high noise can warp principal directions. Consider robust scaling or trimming.
Choose k with care: Use explained variance ratio, scree plots, or cross-validate downstream model performance.
Avoid label leakage: If PCA is fit on the whole dataset before train-test split, you leak information. Fit PCA on training only and apply transform to validation/test.
Validate downstream: PCA is not guaranteed to improve model performance — test it!

When PCA is your friend — and when it’s not

Pros:

Great for compression and visualization (2–3 components for plotting clusters).
Removes linear redundancy, which can help algorithms sensitive to collinearity (e.g., linear regression).
Fast and deterministic; scalable with incremental or randomized SVD for large data.

Cons / Pitfalls:

Unsupervised: PCA ignores the label. The directions of highest variance might be irrelevant for predicting the target.
Loses interpretability: Components are linear mixes of features — harder to explain to stakeholders.
Sensitive to scaling and outliers.
Not ideal for nonlinear structure — use kernel PCA, t-SNE, or UMAP if manifold structure matters.

PCA vs Feature Selection (Correlation pruning, Mutual Information)

Method	Uses labels?	Keeps original features?	Good for	Downsides
Correlation pruning	No (or weakly supervised with label corr)	Yes	Removing duplicates/multicollinearity	Ignores joint information across many features
Mutual information	Yes	Yes	Finding features with predictive info	Needs sufficient data; univariate measures miss multivariate synergies
PCA	No (unsupervised)	No (creates combos)	Compression, visualization, reducing multicollinearity	Can reduce predictive power if label is orthogonal to top variance

Use both: prune blatantly redundant features first (correlation), test mutual info for important predictors, and apply PCA when you need compact representations or visualization.

Advanced: Variants & real-production considerations

Incremental PCA: For streaming or very large datasets — updates components without loading all data.
Kernel PCA: Captures non-linear structure via kernels — useful when clusters lie on curves/manifolds.
Sparse PCA: Tries to produce components that involve fewer original features — partially restores interpretability.
Robust PCA: Separates low-rank structure from sparse noise (great if you have gross corruptions).

Operational tips (linking back to "handling real-world data issues"):

If data drift occurs, principal directions can rotate; monitor explained variance and retrain PCA periodically.
For imbalanced labels, remember PCA doesn’t fix imbalance — combine with SMOTE or class-weighting for classification.
If noise dominates, PCA might capture noise variance; denoise first or choose components based on signal-to-noise considerations.

Quick example (sketch in Python)

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

pca = PCA(n_components=0.95)  # keep 95% variance
X_train_pca = pca.fit_transform(X_train_scaled)
X_val_pca = pca.transform(scaler.transform(X_val))

# then feed X_train_pca into your model

Remember: fit scaler and PCA on training data only.

Closing — a helpful mantra and key takeaways

"PCA compresses variance, not truth — check if that variance is the part of truth you actually need."

Key takeaways:

PCA is unsupervised compression: it optimizes for variance, not predictive power.
Standardize & avoid leakage: always fit transforms on training data only.
Combine tools: use PCA after correlation pruning or as part of a pipeline that also includes supervised feature selection (mutual information) and robust preprocessing.
Monitor in production: components change with drift; use incremental PCA or retrain periodically.

Final thought: use PCA like a stylist — it can make your dataset sleeker and easier to work with, but don’t let it dress your model in a costume that hides the thing you’re trying to predict. Experiment, validate, and when in doubt, plot it — humans are still pretty good at spotting useful structure in 2D.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics