jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Supervised Machine Learning: Regression and Classification
Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

k-Nearest Neighbors for Regressionk-Nearest Neighbors for ClassificationDistance Metrics and Scaling EffectsCurse of DimensionalityEfficient Neighbor Search StructuresKernel Trick IntuitionSVM for ClassificationSVM for Regression (SVR)Linear vs Nonlinear KernelsHyperparameters C, gamma, epsilonMargin Maximization and Slack VariablesFeature Mapping vs Implicit KernelsHandling Nonseparable DataClass Weights in SVMsProbabilistic Outputs for SVM

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Distance- and Kernel-Based Methods

Distance- and Kernel-Based Methods

25781 views

Leverage neighborhood and kernel ideas with kNN and SVM for nonlinear decision boundaries.

Content

3 of 15

Distance Metrics and Scaling Effects

Sassy Metric Masterclass
6977 views
intermediate
humorous
science
visual
gpt-5-mini
6977 views

Versions:

Sassy Metric Masterclass

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Distance Metrics and Scaling Effects — The Secret Sauce Behind kNN and Kernels

Imagine your features are a band performing together. If one of them brings a megaphone and the others whisper, the megaphone will dictate the concert. That, my friend, is what happens when you fail to scale your features.

This lesson builds on your earlier adventures with k-NN for regression and classification and the careful art of thresholding and calibration. There we learned how to turn neighbor votes or distance-weighted sums into class predictions and probabilities. Now we answer the more foundational question: how do we measure neighbor-ness in the first place, and how does scaling warp those measurements like a funhouse mirror?


Why this matters (quick recap and motivation)

  • k-NN and many kernel methods depend entirely on distances. If distances are broken, predictions are broken.
  • Scaling affects not just classification labels but also distance-weighted probabilities, and therefore calibration and threshold decisions you learned earlier.
  • Choosing the wrong metric or failing to scale is like using a ruler measured in pianos — confusing and unhelpful.

Ask yourself: do you want your model to listen to the signal in each feature, or just the loudest one? Scaling chooses which.


Common distance metrics (and when to use them)

Euclidean distance (L2)

  • Formula: d(x, y) = sqrt(sum_i (x_i - y_i)^2)
  • Interpretation: straight-line distance in feature space; sensitive to large differences and scale.
  • Use when features are continuous and comparably scaled.

Manhattan distance (L1)

  • Formula: d(x, y) = sum_i |x_i - y_i|
  • Interpretation: sum of absolute differences; more robust to one large coordinate than L2.
  • Use when outliers in single features exist or when you want sparse influence.

Minkowski distance

  • Generalization of L1 and L2 with parameter p. p=1 -> L1, p=2 -> L2.

Cosine distance / similarity

  • Formula (similarity): cos(theta) = (x dot y) / (||x|| ||y||). Distance = 1 - similarity.
  • Interpretation: measures angle, not magnitude. Great when direction matters more than scale.
  • Use when features represent profiles, term frequencies, or any vector where magnitude is uninformative.

Hamming distance

  • Counts mismatches for categorical or binary features.
  • Use when features are nominal or binary and equal-weight mismatches matter.

Jaccard distance

  • For sparse binary sets: 1 - (intersection / union).
  • Use when presence/absence of attributes matters and overlap is more meaningful than raw mismatches.

Mahalanobis distance

  • Formula: d(x, y) = sqrt((x-y)^T S^{-1} (x-y)), where S is the covariance matrix.
  • Interpretation: scales and decorrelates features using their covariance. Accounts for correlated features and anisotropic variance.
  • Use when features are correlated and you want distances in units of standard deviation along principal axes.

Scaling options and their effects (the megaphone problem solved)

Why scale? Because many metrics (especially Euclidean) are sensitive to units and variance. If one feature ranges 0-10 and another 0-10,000, the second dominates distances.

Common scalers:

  • Standard scaler (z-score): (x - mean) / std. Centers to mean 0 and variance 1. Good default.
  • Min-max scaler: (x - min) / (max - min) maps into [0,1]. Keeps relative spacing but changes variance.
  • Robust scaler: center with median and scale by IQR. Resistant to outliers.
  • Unit vector normalization: scale each sample to unit norm. Often used before cosine similarity or when direction matters.

Practical guidance:

  1. If using Euclidean or RBF kernels, scale features to comparable variance (z-score or robust for outliers).
  2. If features have meaningful zero and bounded ranges, min-max can help preserve interpretability.
  3. For sparse high-dimensional data with counts (text), use tf-idf + unit-normalize before cosine similarity.
  4. For categorical features use appropriate encoding (one-hot or embeddings) and treat their scale carefully; sometimes weighting them by domain knowledge is needed.

How scaling affects k-NN predictions and calibration

  • Distance-weighted k-NN: predictions often use weights such as w = 1 / (d + eps) or w = exp(-d^2 / (2 sigma^2)). Scaling changes the magnitude of d, which changes weights dramatically.
  • Kernels (RBF, Laplacian) are functions of distance. Their bandwidth parameter interacts with feature scaling. A tiny scale makes all points seem far (or near), breaking the kernel.
  • Calibration: if scaling inflates distances, your model may produce overconfident or underconfident probability estimates. That breaks thresholding choices you made earlier when being cost-aware.

Quick example: suppose feature A ranges [0,1], feature B ranges [0,1000]. Without scaling, B controls neighbors, so any threshold to minimize false positives will mostly reflect B. After scaling, both features influence the calibrated probabilities.


Correlation, PCA, and Mahalanobis — when features play twin roles

If features are highly correlated, Euclidean distance double-counts information. Two correlated features that move together amplify the same signal. Mahalanobis distance fixes this by using the inverse covariance to decorrelate feature axes.

  • Use Mahalanobis if you know and trust your covariance estimate (beware small sample sizes producing noisy covariance matrices).
  • Alternatively, use PCA to rotate and reduce dimensions before distance computation. PCA on standardized features often yields better neighbor structure in high-dim spaces.

Note: curse of dimensionality still lurks. As dimensions grow, distances concentrate and nearest neighbors get less meaningful. Dimensionality reduction or feature selection becomes essential.


Practical checklist and recipes

  • Always examine feature scales and distributions. Plot histograms and pairwise scatterplots.
  • Choose metric by feature type: Euclidean/Lp for continuous, Hamming/Jaccard for binary/categorical, cosine for directional data.
  • Standardize continuous features before L2-based methods. Use robust scaling if outliers dominate.
  • For kernels, tune bandwidth after scaling. Try the median heuristic (bandwidth ~ median pairwise distance) as a starting point.
  • If features are correlated, test Mahalanobis or PCA-based distances.
  • When combining categorical and continuous features, consider mixed metrics (Gower distance) or transform categorical features thoughtfully and weight features.

Mini-pseudocode for a safe k-NN pipeline:

# pseudo-python
X_cont, X_cat = split_cont_and_cat(X)
X_cont = RobustScaler().fit_transform(X_cont)
X_cat = OneHotEncode(X_cat)  # or target/embed
X = concatenate([X_cont, X_cat])
model = KNeighborsClassifier(metric='minkowski', p=2, weights='distance')
# tune k and bandwidth-like parameters via cross-validation

Quick thought experiments (to make this stick)

  • If you raise the temperature of a room from 20C to 30C and change from Fahrenheit scale, does neighbor-ness change? Only if your model sees temperature as raw numbers and you forgot scaling.
  • Imagine text vectors: cosine similarity finds similar topics even if one document is longer. Euclidean would be fooled by length differences.

Questions to ask yourself when debugging a weird k-NN model:

  • Are a few features dominating predictions?
  • Do nearest neighbors look sensible visually in a 2D PCA plot?
  • How do probabilities change if I standardize? If they change a lot, scaling mattered.

Final takeaways — the mic drop

  • Distance is not holy; it is mutable, and you shape it by choosing a metric and scaling. Treat that choice like a hyperparameter.
  • Scaling changes both label predictions and the probabilities you may later calibrate and threshold for cost-sensitive decisions.
  • Use the right metric for your data type, standardize sensibly, consider correlations via Mahalanobis or PCA, and always tune kernel bandwidths after scaling.

If your model is ignoring useful features, it is probably because those features are whispering while one loud feature screams. Give them equal volume — or intentionally weight them based on domain knowledge — but at least make the choice consciously.

Now go scale something responsibly. Your k will thank you.

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics