Distance- and Kernel-Based Methods
Leverage neighborhood and kernel ideas with kNN and SVM for nonlinear decision boundaries.
Content
k-Nearest Neighbors for Classification
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
k-Nearest Neighbors for Classification — The Neighborhood Watch of Machine Learning
“Tell me who your neighbors are and I’ll tell you what class you are.” — Probably not an ancient proverb, but definitely k-NN.
You already met k-NN in regression (we smoothed numeric outcomes by averaging neighbors). Now we move to classification: same neighborhood, different party rules. Instead of averaging house prices, we decide who gets invited to the label party. This lesson plugs directly into the calibration and thresholding tools you learned in Classification II — because k-NN gives you a natural probability estimate (fraction of neighbors) that might need calibration and cost-aware thresholding.
Quick intuition (no fluff): how k-NN classifies
Given a query point x0:
- Find the k closest training points to x0 (according to some distance metric).
- For binary classification (labels 0/1), compute the fraction of neighbors with label 1: p̂(1 | x0) = (1/k) ∑ I(y_i = 1).
- Predict class 1 if p̂ ≥ 0.5 (or some other threshold t). For multiclass, pick the mode (largest fraction).
This is the raw version. We can spice it up with distance-weighting, kernel-based weights, and probability calibration.
What’s new vs. k-NN regression (and why it matters)
- Output type: Regression produced a numeric average; classification produces discrete classes and empirical class probabilities.
- Decision thresholding matters: Unlike regression where we’re done with the predicted number, in classification we must choose thresholds based on costs — remember decision curves and net benefit? k-NN gives you p̂ to feed into that machinery.
- Calibration: p̂ = fraction-of-neighbors is an empirical probability. It can be miscalibrated (especially with small k or class imbalance), so calibration plots and Platt scaling / isotonic regression are still relevant.
Formal formulas (so your brain can do math too)
- Unweighted k-NN class probability (binary):
p̂(1 | x0) = (1/k) * sum_{i in N_k(x0)} I(y_i = 1)
- Distance-weighted k-NN:
p̂(1 | x0) = sum_{i in N_k(x0)} w_i * I(y_i = 1) / sum_{i in N_k(x0)} w_i
where w_i = K(d(x0, x_i) / h) (e.g., K(u) = exp(-u^2/2) or K(u)=1/(u+ε))
Weighted voting makes nearer neighbors more influential — good when some neighbors are much closer than others.
Kernel choices vs. k-selection — the dynamic duo
k-NN can be seen as a local kernel smoother with an adaptive neighborhood (k fixed, radius varies). You can also use kernel functions to weight neighbors. Here’s a quick taste:
| Method | Intuition | When to use |
|---|---|---|
| Uniform (plain k-NN) | Each neighbor gets equal vote | Simple, robust if k chosen well |
| Distance-weighted (e.g., 1/d) | Closer neighbors matter more | When neighbor distances vary a lot |
| Gaussian kernel | Smoothly decays influence with distance | Preferable when features are continuous and noise exists |
Choice of k: small k → low bias, high variance; large k → high bias, low variance. Select k by cross-validation using a metric aligned with your goal (AUC, F1, or even net benefit at a given threshold!).
Calibration & thresholding — the sequel you asked for
k-NN gives p̂ directly, but: is that probability well-calibrated? Not always.
- Small k = p̂ is discrete (0, 1/k, 2/k, ...), noisy, generally overconfident locally.
- Class imbalance can skew neighborhood compositions so raw p̂ underestimates/overestimates true class probabilities.
What to do:
- Use calibration plots (reliability diagrams) to inspect p̂ vs. observed frequency. Link this to your earlier module.
- If miscalibrated, apply Platt scaling (sigmoid) or isotonic regression on hold-out data.
- Choose your decision threshold based on costs: compute expected utility or net benefit curve and pick threshold t maximizing net benefit — exactly as we practiced.
Practical tips & gotchas (the spicy, real-world part)
- Scale your features. k-NN is distance-based — unscaled features dominate. Standardize or use robust scaling.
- Pick the right distance metric. Euclidean by default, but Manhattan or Mahalanobis can help. For categorical features, consider Hamming or mixed distances.
- Beware the curse of dimensionality. Distances concentrate in high dimensions → neighbors become meaningless. Dimensionality reduction (PCA, feature selection) or metric learning helps.
- Class imbalance. Oversample/minority weighting or use class-specific thresholds. k-NN’s probability can be biased toward the majority unless corrected.
- Tie-breaking. With even k, ties can happen. Use odd k for binary or break ties by distance-weighted vote.
- Compute & memory cost. k-NN is lazy: O(n) per prediction. Use KD-trees / Ball-trees for low-dim, or approximate nearest-neighbor libraries (FAISS, Annoy) for high-dim / large n.
How to choose k (a short recipe)
- Define metric aligned with your goal: AUC? F1? Net benefit at threshold t?
- Do stratified cross-validation.
- For each k in a reasonable grid (e.g., 1, 3, 5, 10, 20, sqrt(n), n/10), evaluate your metric.
- If you need calibrated probabilities, evaluate calibration error (e.g., Brier score) and maybe calibrate on a hold-out fold.
- Select k that balances predictive performance and calibration/utility.
Pseudocode: classification with optional weighting and cross-validated k
# Pseudocode
for fold in CV:
for k in candidate_k:
for x_val in val_set:
neighbors = find_k_neighbors(x_val, k)
if weighted:
weights = [kernel(distance(x_val, xi)) for xi in neighbors]
p_hat = sum(wi * I(yi==1) for wi, yi in zip(weights, y_neighbors)) / sum(weights)
else:
p_hat = sum(I(yi==1) for yi in y_neighbors) / k
predict = p_hat >= threshold # threshold could be 0.5 or cost-derived
compute_metric_over_fold
choose_k_with_best_avg_metric
A few memorable metaphors
- Small k is like asking the nearest 1–3 friends for dinner advice: very local, maybe noisy, but personal.
- Large k is like polling your whole apartment building: more stable, but you might get bland consensus.
- Distance-weighting is like giving louder mic to the friend standing right next to you.
Final checklist before you ship k-NN to production
- Features scaled appropriately
- Distance metric chosen/validated
- k tuned with cross-validation using the correct metric (or net benefit)
- Probabilities inspected with calibration plots
- Threshold chosen with cost-aware decision rules
- Performance tested with approximate NN if production needs speed
Closing mic drop
k-NN is elegant, intuitive, and maddeningly simple. It gives you immediate probability estimates — which means you can plug it straight into the calibration and thresholding ecosystem you already mastered. But remember: in high dimensions or with skewed classes, those cozy neighbor assumptions can betray you. Treat k-NN like a neighborly tip: great for local wisdom, but sometimes you need the mayor (a parametric model) for city-level policy.
Go forth: tune k like it’s a vinyl record, check your calibration like you check your oil, and always scale your features like you moisturize — consistently and religiously.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!