Distance- and Kernel-Based Methods
Leverage neighborhood and kernel ideas with kNN and SVM for nonlinear decision boundaries.
Content
Distance Metrics and Scaling Effects
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Distance Metrics and Scaling Effects — The Secret Sauce Behind kNN and Kernels
Imagine your features are a band performing together. If one of them brings a megaphone and the others whisper, the megaphone will dictate the concert. That, my friend, is what happens when you fail to scale your features.
This lesson builds on your earlier adventures with k-NN for regression and classification and the careful art of thresholding and calibration. There we learned how to turn neighbor votes or distance-weighted sums into class predictions and probabilities. Now we answer the more foundational question: how do we measure neighbor-ness in the first place, and how does scaling warp those measurements like a funhouse mirror?
Why this matters (quick recap and motivation)
- k-NN and many kernel methods depend entirely on distances. If distances are broken, predictions are broken.
- Scaling affects not just classification labels but also distance-weighted probabilities, and therefore calibration and threshold decisions you learned earlier.
- Choosing the wrong metric or failing to scale is like using a ruler measured in pianos — confusing and unhelpful.
Ask yourself: do you want your model to listen to the signal in each feature, or just the loudest one? Scaling chooses which.
Common distance metrics (and when to use them)
Euclidean distance (L2)
- Formula:
d(x, y) = sqrt(sum_i (x_i - y_i)^2) - Interpretation: straight-line distance in feature space; sensitive to large differences and scale.
- Use when features are continuous and comparably scaled.
Manhattan distance (L1)
- Formula:
d(x, y) = sum_i |x_i - y_i| - Interpretation: sum of absolute differences; more robust to one large coordinate than L2.
- Use when outliers in single features exist or when you want sparse influence.
Minkowski distance
- Generalization of L1 and L2 with parameter p. p=1 -> L1, p=2 -> L2.
Cosine distance / similarity
- Formula (similarity):
cos(theta) = (x dot y) / (||x|| ||y||). Distance = 1 - similarity. - Interpretation: measures angle, not magnitude. Great when direction matters more than scale.
- Use when features represent profiles, term frequencies, or any vector where magnitude is uninformative.
Hamming distance
- Counts mismatches for categorical or binary features.
- Use when features are nominal or binary and equal-weight mismatches matter.
Jaccard distance
- For sparse binary sets:
1 - (intersection / union). - Use when presence/absence of attributes matters and overlap is more meaningful than raw mismatches.
Mahalanobis distance
- Formula:
d(x, y) = sqrt((x-y)^T S^{-1} (x-y)), where S is the covariance matrix. - Interpretation: scales and decorrelates features using their covariance. Accounts for correlated features and anisotropic variance.
- Use when features are correlated and you want distances in units of standard deviation along principal axes.
Scaling options and their effects (the megaphone problem solved)
Why scale? Because many metrics (especially Euclidean) are sensitive to units and variance. If one feature ranges 0-10 and another 0-10,000, the second dominates distances.
Common scalers:
- Standard scaler (z-score):
(x - mean) / std. Centers to mean 0 and variance 1. Good default. - Min-max scaler:
(x - min) / (max - min)maps into [0,1]. Keeps relative spacing but changes variance. - Robust scaler: center with median and scale by IQR. Resistant to outliers.
- Unit vector normalization: scale each sample to unit norm. Often used before cosine similarity or when direction matters.
Practical guidance:
- If using Euclidean or RBF kernels, scale features to comparable variance (z-score or robust for outliers).
- If features have meaningful zero and bounded ranges, min-max can help preserve interpretability.
- For sparse high-dimensional data with counts (text), use tf-idf + unit-normalize before cosine similarity.
- For categorical features use appropriate encoding (one-hot or embeddings) and treat their scale carefully; sometimes weighting them by domain knowledge is needed.
How scaling affects k-NN predictions and calibration
- Distance-weighted k-NN: predictions often use weights such as
w = 1 / (d + eps)orw = exp(-d^2 / (2 sigma^2)). Scaling changes the magnitude of d, which changes weights dramatically. - Kernels (RBF, Laplacian) are functions of distance. Their bandwidth parameter interacts with feature scaling. A tiny scale makes all points seem far (or near), breaking the kernel.
- Calibration: if scaling inflates distances, your model may produce overconfident or underconfident probability estimates. That breaks thresholding choices you made earlier when being cost-aware.
Quick example: suppose feature A ranges [0,1], feature B ranges [0,1000]. Without scaling, B controls neighbors, so any threshold to minimize false positives will mostly reflect B. After scaling, both features influence the calibrated probabilities.
Correlation, PCA, and Mahalanobis — when features play twin roles
If features are highly correlated, Euclidean distance double-counts information. Two correlated features that move together amplify the same signal. Mahalanobis distance fixes this by using the inverse covariance to decorrelate feature axes.
- Use Mahalanobis if you know and trust your covariance estimate (beware small sample sizes producing noisy covariance matrices).
- Alternatively, use PCA to rotate and reduce dimensions before distance computation. PCA on standardized features often yields better neighbor structure in high-dim spaces.
Note: curse of dimensionality still lurks. As dimensions grow, distances concentrate and nearest neighbors get less meaningful. Dimensionality reduction or feature selection becomes essential.
Practical checklist and recipes
- Always examine feature scales and distributions. Plot histograms and pairwise scatterplots.
- Choose metric by feature type: Euclidean/Lp for continuous, Hamming/Jaccard for binary/categorical, cosine for directional data.
- Standardize continuous features before L2-based methods. Use robust scaling if outliers dominate.
- For kernels, tune bandwidth after scaling. Try the median heuristic (bandwidth ~ median pairwise distance) as a starting point.
- If features are correlated, test Mahalanobis or PCA-based distances.
- When combining categorical and continuous features, consider mixed metrics (Gower distance) or transform categorical features thoughtfully and weight features.
Mini-pseudocode for a safe k-NN pipeline:
# pseudo-python
X_cont, X_cat = split_cont_and_cat(X)
X_cont = RobustScaler().fit_transform(X_cont)
X_cat = OneHotEncode(X_cat) # or target/embed
X = concatenate([X_cont, X_cat])
model = KNeighborsClassifier(metric='minkowski', p=2, weights='distance')
# tune k and bandwidth-like parameters via cross-validation
Quick thought experiments (to make this stick)
- If you raise the temperature of a room from 20C to 30C and change from Fahrenheit scale, does neighbor-ness change? Only if your model sees temperature as raw numbers and you forgot scaling.
- Imagine text vectors: cosine similarity finds similar topics even if one document is longer. Euclidean would be fooled by length differences.
Questions to ask yourself when debugging a weird k-NN model:
- Are a few features dominating predictions?
- Do nearest neighbors look sensible visually in a 2D PCA plot?
- How do probabilities change if I standardize? If they change a lot, scaling mattered.
Final takeaways — the mic drop
- Distance is not holy; it is mutable, and you shape it by choosing a metric and scaling. Treat that choice like a hyperparameter.
- Scaling changes both label predictions and the probabilities you may later calibrate and threshold for cost-sensitive decisions.
- Use the right metric for your data type, standardize sensibly, consider correlations via Mahalanobis or PCA, and always tune kernel bandwidths after scaling.
If your model is ignoring useful features, it is probably because those features are whispering while one loud feature screams. Give them equal volume — or intentionally weight them based on domain knowledge — but at least make the choice consciously.
Now go scale something responsibly. Your k will thank you.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!