jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Supervised Machine Learning: Regression and Classification
Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

k-Nearest Neighbors for Regressionk-Nearest Neighbors for ClassificationDistance Metrics and Scaling EffectsCurse of DimensionalityEfficient Neighbor Search StructuresKernel Trick IntuitionSVM for ClassificationSVM for Regression (SVR)Linear vs Nonlinear KernelsHyperparameters C, gamma, epsilonMargin Maximization and Slack VariablesFeature Mapping vs Implicit KernelsHandling Nonseparable DataClass Weights in SVMsProbabilistic Outputs for SVM

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Distance- and Kernel-Based Methods

Distance- and Kernel-Based Methods

25789 views

Leverage neighborhood and kernel ideas with kNN and SVM for nonlinear decision boundaries.

Content

2 of 15

k-Nearest Neighbors for Classification

k-NN Classification — Chaotic TA Edition
2127 views
intermediate
humorous
science
gpt-5-mini
2127 views

Versions:

k-NN Classification — Chaotic TA Edition

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

k-Nearest Neighbors for Classification — The Neighborhood Watch of Machine Learning

“Tell me who your neighbors are and I’ll tell you what class you are.” — Probably not an ancient proverb, but definitely k-NN.

You already met k-NN in regression (we smoothed numeric outcomes by averaging neighbors). Now we move to classification: same neighborhood, different party rules. Instead of averaging house prices, we decide who gets invited to the label party. This lesson plugs directly into the calibration and thresholding tools you learned in Classification II — because k-NN gives you a natural probability estimate (fraction of neighbors) that might need calibration and cost-aware thresholding.


Quick intuition (no fluff): how k-NN classifies

Given a query point x0:

  1. Find the k closest training points to x0 (according to some distance metric).
  2. For binary classification (labels 0/1), compute the fraction of neighbors with label 1: p̂(1 | x0) = (1/k) ∑ I(y_i = 1).
  3. Predict class 1 if p̂ ≥ 0.5 (or some other threshold t). For multiclass, pick the mode (largest fraction).

This is the raw version. We can spice it up with distance-weighting, kernel-based weights, and probability calibration.


What’s new vs. k-NN regression (and why it matters)

  • Output type: Regression produced a numeric average; classification produces discrete classes and empirical class probabilities.
  • Decision thresholding matters: Unlike regression where we’re done with the predicted number, in classification we must choose thresholds based on costs — remember decision curves and net benefit? k-NN gives you p̂ to feed into that machinery.
  • Calibration: p̂ = fraction-of-neighbors is an empirical probability. It can be miscalibrated (especially with small k or class imbalance), so calibration plots and Platt scaling / isotonic regression are still relevant.

Formal formulas (so your brain can do math too)

  • Unweighted k-NN class probability (binary):
p̂(1 | x0) = (1/k) * sum_{i in N_k(x0)} I(y_i = 1)
  • Distance-weighted k-NN:
p̂(1 | x0) = sum_{i in N_k(x0)} w_i * I(y_i = 1) / sum_{i in N_k(x0)} w_i
where w_i = K(d(x0, x_i) / h)  (e.g., K(u) = exp(-u^2/2) or K(u)=1/(u+ε))

Weighted voting makes nearer neighbors more influential — good when some neighbors are much closer than others.


Kernel choices vs. k-selection — the dynamic duo

k-NN can be seen as a local kernel smoother with an adaptive neighborhood (k fixed, radius varies). You can also use kernel functions to weight neighbors. Here’s a quick taste:

Method Intuition When to use
Uniform (plain k-NN) Each neighbor gets equal vote Simple, robust if k chosen well
Distance-weighted (e.g., 1/d) Closer neighbors matter more When neighbor distances vary a lot
Gaussian kernel Smoothly decays influence with distance Preferable when features are continuous and noise exists

Choice of k: small k → low bias, high variance; large k → high bias, low variance. Select k by cross-validation using a metric aligned with your goal (AUC, F1, or even net benefit at a given threshold!).


Calibration & thresholding — the sequel you asked for

k-NN gives p̂ directly, but: is that probability well-calibrated? Not always.

  • Small k = p̂ is discrete (0, 1/k, 2/k, ...), noisy, generally overconfident locally.
  • Class imbalance can skew neighborhood compositions so raw p̂ underestimates/overestimates true class probabilities.

What to do:

  • Use calibration plots (reliability diagrams) to inspect p̂ vs. observed frequency. Link this to your earlier module.
  • If miscalibrated, apply Platt scaling (sigmoid) or isotonic regression on hold-out data.
  • Choose your decision threshold based on costs: compute expected utility or net benefit curve and pick threshold t maximizing net benefit — exactly as we practiced.

Practical tips & gotchas (the spicy, real-world part)

  1. Scale your features. k-NN is distance-based — unscaled features dominate. Standardize or use robust scaling.
  2. Pick the right distance metric. Euclidean by default, but Manhattan or Mahalanobis can help. For categorical features, consider Hamming or mixed distances.
  3. Beware the curse of dimensionality. Distances concentrate in high dimensions → neighbors become meaningless. Dimensionality reduction (PCA, feature selection) or metric learning helps.
  4. Class imbalance. Oversample/minority weighting or use class-specific thresholds. k-NN’s probability can be biased toward the majority unless corrected.
  5. Tie-breaking. With even k, ties can happen. Use odd k for binary or break ties by distance-weighted vote.
  6. Compute & memory cost. k-NN is lazy: O(n) per prediction. Use KD-trees / Ball-trees for low-dim, or approximate nearest-neighbor libraries (FAISS, Annoy) for high-dim / large n.

How to choose k (a short recipe)

  1. Define metric aligned with your goal: AUC? F1? Net benefit at threshold t?
  2. Do stratified cross-validation.
  3. For each k in a reasonable grid (e.g., 1, 3, 5, 10, 20, sqrt(n), n/10), evaluate your metric.
  4. If you need calibrated probabilities, evaluate calibration error (e.g., Brier score) and maybe calibrate on a hold-out fold.
  5. Select k that balances predictive performance and calibration/utility.

Pseudocode: classification with optional weighting and cross-validated k

# Pseudocode
for fold in CV:
    for k in candidate_k:
        for x_val in val_set:
            neighbors = find_k_neighbors(x_val, k)
            if weighted:
                weights = [kernel(distance(x_val, xi)) for xi in neighbors]
                p_hat = sum(wi * I(yi==1) for wi, yi in zip(weights, y_neighbors)) / sum(weights)
            else:
                p_hat = sum(I(yi==1) for yi in y_neighbors) / k
            predict = p_hat >= threshold  # threshold could be 0.5 or cost-derived
        compute_metric_over_fold
choose_k_with_best_avg_metric

A few memorable metaphors

  • Small k is like asking the nearest 1–3 friends for dinner advice: very local, maybe noisy, but personal.
  • Large k is like polling your whole apartment building: more stable, but you might get bland consensus.
  • Distance-weighting is like giving louder mic to the friend standing right next to you.

Final checklist before you ship k-NN to production

  • Features scaled appropriately
  • Distance metric chosen/validated
  • k tuned with cross-validation using the correct metric (or net benefit)
  • Probabilities inspected with calibration plots
  • Threshold chosen with cost-aware decision rules
  • Performance tested with approximate NN if production needs speed

Closing mic drop

k-NN is elegant, intuitive, and maddeningly simple. It gives you immediate probability estimates — which means you can plug it straight into the calibration and thresholding ecosystem you already mastered. But remember: in high dimensions or with skewed classes, those cozy neighbor assumptions can betray you. Treat k-NN like a neighborly tip: great for local wisdom, but sometimes you need the mayor (a parametric model) for city-level policy.

Go forth: tune k like it’s a vinyl record, check your calibration like you check your oil, and always scale your features like you moisturize — consistently and religiously.

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics