Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

k-Nearest Neighbors for Regression k-Nearest Neighbors for Classification Distance Metrics and Scaling Effects Curse of Dimensionality Efficient Neighbor Search Structures Kernel Trick Intuition SVM for Classification SVM for Regression (SVR)Linear vs Nonlinear Kernels Hyperparameters C, gamma, epsilon Margin Maximization and Slack Variables Feature Mapping vs Implicit Kernels Handling Nonseparable Data Class Weights in SVMs Probabilistic Outputs for SVM

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Distance- and Kernel-Based Methods

Distance- and Kernel-Based Methods

25789 views

Leverage neighborhood and kernel ideas with kNN and SVM for nonlinear decision boundaries.

Content

2 of 15

k-Nearest Neighbors for Classification

k-NN Classification — Chaotic TA Edition

2127 views

intermediate

humorous

science

gpt-5-mini

2127 views

Versions:

k-NN Classification — Chaotic TA Edition

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

k-Nearest Neighbors for Classification — The Neighborhood Watch of Machine Learning

“Tell me who your neighbors are and I’ll tell you what class you are.” — Probably not an ancient proverb, but definitely k-NN.

You already met k-NN in regression (we smoothed numeric outcomes by averaging neighbors). Now we move to classification: same neighborhood, different party rules. Instead of averaging house prices, we decide who gets invited to the label party. This lesson plugs directly into the calibration and thresholding tools you learned in Classification II — because k-NN gives you a natural probability estimate (fraction of neighbors) that might need calibration and cost-aware thresholding.

Quick intuition (no fluff): how k-NN classifies

Given a query point x0:

Find the k closest training points to x0 (according to some distance metric).
For binary classification (labels 0/1), compute the fraction of neighbors with label 1: p̂(1 | x0) = (1/k) ∑ I(y_i = 1).
Predict class 1 if p̂ ≥ 0.5 (or some other threshold t). For multiclass, pick the mode (largest fraction).

This is the raw version. We can spice it up with distance-weighting, kernel-based weights, and probability calibration.

What’s new vs. k-NN regression (and why it matters)

Output type: Regression produced a numeric average; classification produces discrete classes and empirical class probabilities.
Decision thresholding matters: Unlike regression where we’re done with the predicted number, in classification we must choose thresholds based on costs — remember decision curves and net benefit? k-NN gives you p̂ to feed into that machinery.
Calibration: p̂ = fraction-of-neighbors is an empirical probability. It can be miscalibrated (especially with small k or class imbalance), so calibration plots and Platt scaling / isotonic regression are still relevant.

Formal formulas (so your brain can do math too)

Unweighted k-NN class probability (binary):

p̂(1 | x0) = (1/k) * sum_{i in N_k(x0)} I(y_i = 1)

Distance-weighted k-NN:

p̂(1 | x0) = sum_{i in N_k(x0)} w_i * I(y_i = 1) / sum_{i in N_k(x0)} w_i
where w_i = K(d(x0, x_i) / h)  (e.g., K(u) = exp(-u^2/2) or K(u)=1/(u+ε))

Weighted voting makes nearer neighbors more influential — good when some neighbors are much closer than others.

Kernel choices vs. k-selection — the dynamic duo

k-NN can be seen as a local kernel smoother with an adaptive neighborhood (k fixed, radius varies). You can also use kernel functions to weight neighbors. Here’s a quick taste:

Method	Intuition	When to use
Uniform (plain k-NN)	Each neighbor gets equal vote	Simple, robust if k chosen well
Distance-weighted (e.g., 1/d)	Closer neighbors matter more	When neighbor distances vary a lot
Gaussian kernel	Smoothly decays influence with distance	Preferable when features are continuous and noise exists

Choice of k: small k → low bias, high variance; large k → high bias, low variance. Select k by cross-validation using a metric aligned with your goal (AUC, F1, or even net benefit at a given threshold!).

Calibration & thresholding — the sequel you asked for

k-NN gives p̂ directly, but: is that probability well-calibrated? Not always.

Small k = p̂ is discrete (0, 1/k, 2/k, ...), noisy, generally overconfident locally.
Class imbalance can skew neighborhood compositions so raw p̂ underestimates/overestimates true class probabilities.

What to do:

Use calibration plots (reliability diagrams) to inspect p̂ vs. observed frequency. Link this to your earlier module.
If miscalibrated, apply Platt scaling (sigmoid) or isotonic regression on hold-out data.
Choose your decision threshold based on costs: compute expected utility or net benefit curve and pick threshold t maximizing net benefit — exactly as we practiced.

Practical tips & gotchas (the spicy, real-world part)

Scale your features. k-NN is distance-based — unscaled features dominate. Standardize or use robust scaling.
Pick the right distance metric. Euclidean by default, but Manhattan or Mahalanobis can help. For categorical features, consider Hamming or mixed distances.
Beware the curse of dimensionality. Distances concentrate in high dimensions → neighbors become meaningless. Dimensionality reduction (PCA, feature selection) or metric learning helps.
Class imbalance. Oversample/minority weighting or use class-specific thresholds. k-NN’s probability can be biased toward the majority unless corrected.
Tie-breaking. With even k, ties can happen. Use odd k for binary or break ties by distance-weighted vote.
Compute & memory cost. k-NN is lazy: O(n) per prediction. Use KD-trees / Ball-trees for low-dim, or approximate nearest-neighbor libraries (FAISS, Annoy) for high-dim / large n.

How to choose k (a short recipe)

Define metric aligned with your goal: AUC? F1? Net benefit at threshold t?
Do stratified cross-validation.
For each k in a reasonable grid (e.g., 1, 3, 5, 10, 20, sqrt(n), n/10), evaluate your metric.
If you need calibrated probabilities, evaluate calibration error (e.g., Brier score) and maybe calibrate on a hold-out fold.
Select k that balances predictive performance and calibration/utility.

Pseudocode: classification with optional weighting and cross-validated k

# Pseudocode
for fold in CV:
    for k in candidate_k:
        for x_val in val_set:
            neighbors = find_k_neighbors(x_val, k)
            if weighted:
                weights = [kernel(distance(x_val, xi)) for xi in neighbors]
                p_hat = sum(wi * I(yi==1) for wi, yi in zip(weights, y_neighbors)) / sum(weights)
            else:
                p_hat = sum(I(yi==1) for yi in y_neighbors) / k
            predict = p_hat >= threshold  # threshold could be 0.5 or cost-derived
        compute_metric_over_fold
choose_k_with_best_avg_metric

A few memorable metaphors

Small k is like asking the nearest 1–3 friends for dinner advice: very local, maybe noisy, but personal.
Large k is like polling your whole apartment building: more stable, but you might get bland consensus.
Distance-weighting is like giving louder mic to the friend standing right next to you.

Final checklist before you ship k-NN to production

Features scaled appropriately
Distance metric chosen/validated
k tuned with cross-validation using the correct metric (or net benefit)
Probabilities inspected with calibration plots
Threshold chosen with cost-aware decision rules
Performance tested with approximate NN if production needs speed

Closing mic drop

k-NN is elegant, intuitive, and maddeningly simple. It gives you immediate probability estimates — which means you can plug it straight into the calibration and thresholding ecosystem you already mastered. But remember: in high dimensions or with skewed classes, those cozy neighbor assumptions can betray you. Treat k-NN like a neighborly tip: great for local wisdom, but sometimes you need the mayor (a parametric model) for city-level policy.

Go forth: tune k like it’s a vinyl record, check your calibration like you check your oil, and always scale your features like you moisturize — consistently and religiously.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics