Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

k-Nearest Neighbors for Regression k-Nearest Neighbors for Classification Distance Metrics and Scaling Effects Curse of Dimensionality Efficient Neighbor Search Structures Kernel Trick Intuition SVM for Classification SVM for Regression (SVR)Linear vs Nonlinear Kernels Hyperparameters C, gamma, epsilon Margin Maximization and Slack Variables Feature Mapping vs Implicit Kernels Handling Nonseparable Data Class Weights in SVMs Probabilistic Outputs for SVM

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Distance- and Kernel-Based Methods

Distance- and Kernel-Based Methods

25796 views

Leverage neighborhood and kernel ideas with kNN and SVM for nonlinear decision boundaries.

Content

4 of 15

Curse of Dimensionality

Curse of Dimensionality — Witty TA Deep Dive

4489 views

intermediate

humorous

visual

science

gpt-5-mini

4489 views

Versions:

Curse of Dimensionality — Witty TA Deep Dive

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Curse of Dimensionality — Why Distances Start Lying (and What to Do About It)

"In high dimensions, everything is almost the same distance away from everything else." — Your nearest neighbor, lying to you softly.

You already know about distance metrics and how scaling features matters, and we've just used those ideas in k-NN classification. Now let's crank up the dimensionality knob and witness the ensuing existential crisis for distance- and kernel-based methods. This is the curse of dimensionality — the reason your perfectly reasonable algorithm at d=2 becomes a confused oracle at d=2000.

What is the curse of dimensionality (in plain, melodramatic English)?

The curse of dimensionality is the set of nasty phenomena that appear when the number of features (dimensions) grows:

Data becomes sparse — points occupy exponentially larger space.
Distances 'concentrate' — the difference between nearest and farthest neighbor shrinks.
Kernel values (e.g., Gaussian) become nearly constant unless you tune bandwidth absurdly.
Sample complexity explodes — you need exponentially more data to cover the space.

Intuitively: imagine trying to cover a line segment with samples — easy. A square — harder. A cube — harder. A 1000-dimensional hypercube? You need a universe-sized dataset to meaningfully populate it.

Math snack (no PhD required)

Volume scaling: For many shapes, volume scales with radius^d. If you want to cover a unit hypercube to granularity r, you need about (1/r)^d samples. Blow-up = exponential in d.
Nearest neighbor distance in unit hypercube: With n random points in [0,1]^d, the expected distance to the nearest neighbor scales like n^{-1/d}. So to halve that distance you need n multiplied by 2^d — exponential cost.
Distance concentration: For many distributions, Var(distance) / E(distance) --> 0 as d increases. Distances bunch up, so 'closest' and 'farthest' become almost equal.

These formulas explain why your weirdly clever distance weighting or Gaussian kernel suddenly stops discriminating.

What goes wrong for k-NN and kernel classifiers

k-NN: If all neighbors are almost equally far, the label voting becomes noisy and unstable. k becomes a fragile hyperparameter: too small -> variance; too large -> include irrelevant points.
Kernels (e.g., Gaussian RBF): Kernel value K(x, x') = exp(-||x-x'||^2 / (2σ^2)). If ||x-x'||^2 is nearly constant across pairs, then K is nearly constant unless σ is tiny. But tiny σ makes the kernel spike only at identical points — overfitting.
Calibration / thresholds: Remember our previous topic about choosing thresholds and calibrating probabilities? In high-d, probability estimates from distance- or kernel-based density estimates are poorly behaved: they overfit local noise or are nearly uniform, so thresholds become meaningless.

Question: How do you pick σ when pairwise distances concentrate? The median heuristic (σ = median pairwise distance) can collapse — median may carry little signal.

Real-world examples

Text data with TF-IDF: Dimensionality = vocabulary size (thousands). Cosine similarity often beats Euclidean distance because angle matters more than magnitude in sparse high-d space.
Genomics: Each sample has thousands of SNPs. Euclidean distance doesn't capture meaningful biological similarity unless you reduce dimensions or pick features.
Image embeddings: Raw pixels are huge-d but lives on a lower-dimensional manifold — using PCA/autoencoders or learned embeddings fixes the curse.

Mitigation strategies (practical playbook)

Feature selection
- Remove irrelevant features. Simpler, often effective.
Dimensionality reduction
- Linear: PCA (keeps variance), random projections (Johnson–Lindenstrauss), truncated SVD for sparse matrices.
- Nonlinear: autoencoders, kernel PCA, UMAP/t-SNE (mainly visualization).
Metric learning
- Learn a Mahalanobis distance (or use LMNN) so distances reflect label structure.
Local/adaptive kernels
- Use local bandwidth σ_i = distance to k-th neighbor (local scaling) rather than a single global σ.
Example (pseudocode):
```
for each point x_i:
    sigma_i = dist(x_i, x_(kth_nearest))
kernel(x, x_i) = exp(-||x-x_i||^2 / (2 * sigma_i * sigma))
```
Use similarity measures that suit high-d sparse data
- Cosine similarity for text, Jaccard for sets, Hamming for binary vectors.
Switch method
- Tree-based models (random forests, gradient boosting) and linear models with regularization often handle high-d better than plain k-NN.
Approximate neighbors and hashing
- Use locality-sensitive hashing (LSH) or ANN libraries (FAISS, Annoy) to scale to large n in high d.
Regularization & cross-validation
- Use strong validation and regularization to prevent overfitting when kernels become spiky.

Quick comparison table

Phenomenon	Low-d behavior	High-d failure mode	Mitigation
Distance spread	Nearest << Farthest	Distances concentrate (Nearest ≈ Farthest)	Dim. reduction, metric learning
Kernel values	Informative gradient	Nearly uniform or spiky	Local bandwidth, adaptive kernels
Sample need	Moderate n	n grows ~c^d	Feature selection, projections
Similarity	Euclidean OK	Euclidean nonsense for sparse data	Cosine/Jaccard, Mahalanobis

Heuristic checklist (what to try, in order)

Plot pairwise distance distribution. If it’s narrow, curse suspected.
Try PCA and look at explained variance. If top components capture most variance, reduce d.
Replace Euclidean with cosine/Hamming if data is sparse/binary.
Use local scaling or metric learning if labels indicate structure.
If still stuck, move to models less dependent on raw distances (regularized linear models, tree ensembles, neural nets).

Closing — the punchline

The curse of dimensionality isn't mystical — it's a geometric inevitability. Distance- and kernel-based methods rely on meaningful differences in distances; high dimension often erodes those differences. But the story isn't doom: often real data lie on low-dimensional manifolds or have redundant features. Use diagnostics (distance histograms, PCA), then apply the right fix: reduce dimensionality, learn a metric, or switch similarity functions.

Final tiny obsession-worthy thought: if your model is screaming for more data to solve a high-d problem, ask whether you can instead give it better representation. Better features beat bigger datasets most of the time.

Key takeaways:

Distances concentrate as d grows; kernels collapse without careful bandwidth choice.
Mitigate with feature engineering, dimensionality reduction, metric learning, or alternative similarities.
Always relate these fixes back to calibration and decision thresholds: if your estimated probabilities are untrustworthy, cost-aware decisions will be too.

Version note: Builds on earlier discussions of distance metrics and k-NN, and links to calibration/thresholding by explaining why probability estimates from distance/kernel methods become unreliable in high dimensions.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics