Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

k-Nearest Neighbors for Regression k-Nearest Neighbors for Classification Distance Metrics and Scaling Effects Curse of Dimensionality Efficient Neighbor Search Structures Kernel Trick Intuition SVM for Classification SVM for Regression (SVR)Linear vs Nonlinear Kernels Hyperparameters C, gamma, epsilon Margin Maximization and Slack Variables Feature Mapping vs Implicit Kernels Handling Nonseparable Data Class Weights in SVMs Probabilistic Outputs for SVM

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Distance- and Kernel-Based Methods

Distance- and Kernel-Based Methods

25796 views

Leverage neighborhood and kernel ideas with kNN and SVM for nonlinear decision boundaries.

Content

3 of 15

Distance Metrics and Scaling Effects

Sassy Metric Masterclass

6977 views

intermediate

humorous

science

visual

gpt-5-mini

6977 views

Versions:

Sassy Metric Masterclass

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Distance Metrics and Scaling Effects — The Secret Sauce Behind kNN and Kernels

Imagine your features are a band performing together. If one of them brings a megaphone and the others whisper, the megaphone will dictate the concert. That, my friend, is what happens when you fail to scale your features.

This lesson builds on your earlier adventures with k-NN for regression and classification and the careful art of thresholding and calibration. There we learned how to turn neighbor votes or distance-weighted sums into class predictions and probabilities. Now we answer the more foundational question: how do we measure neighbor-ness in the first place, and how does scaling warp those measurements like a funhouse mirror?

Why this matters (quick recap and motivation)

k-NN and many kernel methods depend entirely on distances. If distances are broken, predictions are broken.
Scaling affects not just classification labels but also distance-weighted probabilities, and therefore calibration and threshold decisions you learned earlier.
Choosing the wrong metric or failing to scale is like using a ruler measured in pianos — confusing and unhelpful.

Ask yourself: do you want your model to listen to the signal in each feature, or just the loudest one? Scaling chooses which.

Common distance metrics (and when to use them)

Euclidean distance (L2)

Formula: d(x, y) = sqrt(sum_i (x_i - y_i)^2)
Interpretation: straight-line distance in feature space; sensitive to large differences and scale.
Use when features are continuous and comparably scaled.

Manhattan distance (L1)

Formula: d(x, y) = sum_i |x_i - y_i|
Interpretation: sum of absolute differences; more robust to one large coordinate than L2.
Use when outliers in single features exist or when you want sparse influence.

Minkowski distance

Generalization of L1 and L2 with parameter p. p=1 -> L1, p=2 -> L2.

Cosine distance / similarity

Formula (similarity): cos(theta) = (x dot y) / (||x|| ||y||). Distance = 1 - similarity.
Interpretation: measures angle, not magnitude. Great when direction matters more than scale.
Use when features represent profiles, term frequencies, or any vector where magnitude is uninformative.

Hamming distance

Counts mismatches for categorical or binary features.
Use when features are nominal or binary and equal-weight mismatches matter.

Jaccard distance

For sparse binary sets: 1 - (intersection / union).
Use when presence/absence of attributes matters and overlap is more meaningful than raw mismatches.

Mahalanobis distance

Formula: d(x, y) = sqrt((x-y)^T S^{-1} (x-y)), where S is the covariance matrix.
Interpretation: scales and decorrelates features using their covariance. Accounts for correlated features and anisotropic variance.
Use when features are correlated and you want distances in units of standard deviation along principal axes.

Scaling options and their effects (the megaphone problem solved)

Why scale? Because many metrics (especially Euclidean) are sensitive to units and variance. If one feature ranges 0-10 and another 0-10,000, the second dominates distances.

Common scalers:

Standard scaler (z-score): (x - mean) / std. Centers to mean 0 and variance 1. Good default.
Min-max scaler: (x - min) / (max - min) maps into [0,1]. Keeps relative spacing but changes variance.
Robust scaler: center with median and scale by IQR. Resistant to outliers.
Unit vector normalization: scale each sample to unit norm. Often used before cosine similarity or when direction matters.

Practical guidance:

If using Euclidean or RBF kernels, scale features to comparable variance (z-score or robust for outliers).
If features have meaningful zero and bounded ranges, min-max can help preserve interpretability.
For sparse high-dimensional data with counts (text), use tf-idf + unit-normalize before cosine similarity.
For categorical features use appropriate encoding (one-hot or embeddings) and treat their scale carefully; sometimes weighting them by domain knowledge is needed.

How scaling affects k-NN predictions and calibration

Distance-weighted k-NN: predictions often use weights such as w = 1 / (d + eps) or w = exp(-d^2 / (2 sigma^2)). Scaling changes the magnitude of d, which changes weights dramatically.
Kernels (RBF, Laplacian) are functions of distance. Their bandwidth parameter interacts with feature scaling. A tiny scale makes all points seem far (or near), breaking the kernel.
Calibration: if scaling inflates distances, your model may produce overconfident or underconfident probability estimates. That breaks thresholding choices you made earlier when being cost-aware.

Quick example: suppose feature A ranges [0,1], feature B ranges [0,1000]. Without scaling, B controls neighbors, so any threshold to minimize false positives will mostly reflect B. After scaling, both features influence the calibrated probabilities.

Correlation, PCA, and Mahalanobis — when features play twin roles

If features are highly correlated, Euclidean distance double-counts information. Two correlated features that move together amplify the same signal. Mahalanobis distance fixes this by using the inverse covariance to decorrelate feature axes.

Use Mahalanobis if you know and trust your covariance estimate (beware small sample sizes producing noisy covariance matrices).
Alternatively, use PCA to rotate and reduce dimensions before distance computation. PCA on standardized features often yields better neighbor structure in high-dim spaces.

Note: curse of dimensionality still lurks. As dimensions grow, distances concentrate and nearest neighbors get less meaningful. Dimensionality reduction or feature selection becomes essential.

Practical checklist and recipes

Always examine feature scales and distributions. Plot histograms and pairwise scatterplots.
Choose metric by feature type: Euclidean/Lp for continuous, Hamming/Jaccard for binary/categorical, cosine for directional data.
Standardize continuous features before L2-based methods. Use robust scaling if outliers dominate.
For kernels, tune bandwidth after scaling. Try the median heuristic (bandwidth ~ median pairwise distance) as a starting point.
If features are correlated, test Mahalanobis or PCA-based distances.
When combining categorical and continuous features, consider mixed metrics (Gower distance) or transform categorical features thoughtfully and weight features.

Mini-pseudocode for a safe k-NN pipeline:

# pseudo-python
X_cont, X_cat = split_cont_and_cat(X)
X_cont = RobustScaler().fit_transform(X_cont)
X_cat = OneHotEncode(X_cat)  # or target/embed
X = concatenate([X_cont, X_cat])
model = KNeighborsClassifier(metric='minkowski', p=2, weights='distance')
# tune k and bandwidth-like parameters via cross-validation

Quick thought experiments (to make this stick)

If you raise the temperature of a room from 20C to 30C and change from Fahrenheit scale, does neighbor-ness change? Only if your model sees temperature as raw numbers and you forgot scaling.
Imagine text vectors: cosine similarity finds similar topics even if one document is longer. Euclidean would be fooled by length differences.

Questions to ask yourself when debugging a weird k-NN model:

Are a few features dominating predictions?
Do nearest neighbors look sensible visually in a 2D PCA plot?
How do probabilities change if I standardize? If they change a lot, scaling mattered.

Final takeaways — the mic drop

Distance is not holy; it is mutable, and you shape it by choosing a metric and scaling. Treat that choice like a hyperparameter.
Scaling changes both label predictions and the probabilities you may later calibrate and threshold for cost-sensitive decisions.
Use the right metric for your data type, standardize sensibly, consider correlations via Mahalanobis or PCA, and always tune kernel bandwidths after scaling.

If your model is ignoring useful features, it is probably because those features are whispering while one loud feature screams. Give them equal volume — or intentionally weight them based on domain knowledge — but at least make the choice consciously.

Now go scale something responsibly. Your k will thank you.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics