Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

Confusion Matrix Anatomy Accuracy, Precision, Recall, F1 ROC Curves and AUC Precision–Recall Curves and AUC-PR Threshold Selection Strategies Cost Curves and Expected Utility Probability Calibration Methods Brier Score and Log Loss Multiclass Metrics and Averaging Ranking Metrics for Imbalanced Data Top-k and Coverage Metrics Macro vs Micro vs Weighted Scores Cumulative Gain and Lift Charts Calibration Plots and Reliability Decision Curves and Net Benefit

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Classification II: Thresholding, Calibration, and Metrics

Classification II: Thresholding, Calibration, and Metrics

32355 views

Make cost-aware decisions by selecting thresholds, calibrating probabilities, and using the right metrics.

Content

5 of 15

Threshold Selection Strategies

Thresholds with Sass and Sense

4986 views

intermediate

humorous

sarcastic

machine learning

gpt-5-mini

4986 views

Versions:

Thresholds with Sass and Sense

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Threshold Selection Strategies — Where Probabilities Go to Wear Costumes

"A probability isn't a decision. A threshold is the wardrobe change."

You already know how to get a probability out of a model (hello, logistic regression and friends). You also know how to visualize classifier behavior across thresholds using ROC and PR curves. Great — those were the rehearsals. Now we pick the outfit for opening night: the threshold. This note walks through principled, practical, and delightfully pragmatic ways to choose a threshold for binary classification.

Why thresholding deserves a moment of existential thought

Your model spits out p = P(y=1 | x). That’s probability, not a verdict. A threshold turns p into a yes/no call.
Different thresholds change precision, recall, specificity, F1, and business outcomes. ROC/PR curves showed you the landscape — threshold selection chooses the vantage point.
Bad thresholds = wasted effort, false alarms, missed opportunities, possibly regulatory trouble. Choose carefully.

The core decision-theory rule (aka the adult way to set a threshold)

If false positives cost c_fp and false negatives cost c_fn, minimize expected cost by predicting positive when:

p > c_fp / (c_fp + c_fn)

Equivalently, using odds:

p/(1-p) > c_fp / c_fn

Why this works: choosing positive risks a FP with probability (1-p), costing (1-p)c_fp; choosing negative risks a FN with probability p, costing pc_fn. Compare the two.

This is Bayes-style decision making. It depends on your costs, not on some arbitrary 0.5.

Practical tip: always convert business penalties into relative costs (c_fp vs c_fn). If hospital readmission is catastrophic and a false alarm is cheap, set threshold low.

Simple strategies you’ll actually use in the wild

Fixed default (0.5)
- Pros: simple. Cons: assumes calibrated probabilities and balanced costs/classes. Often wrong.
Maximize a metric on validation set (F1, accuracy, MCC)
- Compute metric for many thresholds; pick argmax. Works if metric reflects business goal.
Youden's J (ROC-based)
- Choose threshold maximizing Sensitivity + Specificity - 1 (TPR - FPR). Good when you treat errors symmetrically.
Minimize distance to top-left on ROC
- Choose threshold minimizing sqrt((1-TPR)^2 + FPR^2). Geometric heuristic.
PR-based selection
- If classes are imbalanced and precision matters, use PR curve to find threshold giving required precision or recall.
Cost-based threshold (see decision-theory above)
- Use when you can quantify costs.

When to use ROC vs PR for picking thresholds

ROC-based rules (Youden, min-distance) assume roughly equal class importance and are insensitive to class imbalance.
PR-based selection is better when positive class is rare and you care about precision/recall trade-offs. A high ROC AUC can hide bad precision at relevant recall levels.

Think: ROC tells you the ability to rank positives above negatives; PR tells you how many of the things you call positive are actually positive. Use PR when false-positives are painful or positives are rare.

Calibration matters — don’t threshold on lies

If your model is miscalibrated, a probability p of 0.6 might not mean 60% true positives. Thresholds that rely on absolute p (like cost-based thresholds) require calibration.

Common calibration fixes:

Platt scaling (sigmoid / parametric) — fits a logistic to model scores on validation data
Isotonic regression — non-parametric, more flexible but needs more data

Always calibrate on a held-out set, then pick thresholds on another held-out set (or use nested CV). Otherwise you’ll overfit the threshold to noise.

Algorithmic recipe (pseudocode) — pick threshold by maximizing F1

# inputs: val_probs (N), val_labels (0/1), thresholds = np.linspace(0,1,1000)
best_t, best_f1 = 0, -inf
for t in thresholds:
    preds = val_probs >= t
    f1 = f1_score(val_labels, preds)
    if f1 > best_f1:
        best_f1 = f1
        best_t = t
# use best_t on test/production

Notes: repeat with cross-validation to estimate variability and avoid overfitting to a single validation split.

Advanced/robust approaches

Cross-validated thresholding: pick thresholds in each fold then average or pick most frequent threshold
Cost curves & decision curves: plot net benefit over thresholds to pick based on utility rather than metrics
Per-group thresholds: different thresholds for subpopulations when base rates differ (be careful with fairness implications)
Reject option (abstain): allow the model to say "I don't know" when p is near the threshold; route to human review

Table — quick compare of selection strategies

Strategy	When to use	Pros	Cons
Default 0.5	Quick prototypes	Simple	Often wrong with imbalance/costs
Max F1 / MCC	Metric-driven goals	Directly optimizes your metric	Overfits if no holdout; metric-dependent
Youden's J	Symmetric errors	Simple ROC-based	Ignores prevalence
Min-distance (ROC)	General tradeoff	Intuitive geometry	Not cost-aware
PR-based	Rare positives	Focuses on precision/recall	Can be noisy with few positives
Cost-based (Bayes)	Known costs	Decision-theory optimal	Needs quantifiable costs & calibration

A practical checklist before you deploy

Is the probability well-calibrated? If not, calibrate.
Do you know the relative costs of FP and FN? If yes, use cost-based thresholding.
If you must optimize a metric (F1, MCC), pick threshold on a held-out set or via CV.
If positives are rare, prefer PR-guided thresholds over naive ROC heuristics.
Compute confidence intervals for performance at the chosen threshold.
Consider a reject option if misclassifications are costly.

Parting shot (a tiny rant and a tiny wisdom)

Choosing a threshold is the most human part of modeling: it requires values, priorities, and trade-offs. Your model gives you probabilities; your organization gives you consequences. Bring both to the table.

"Calibration gets you honesty. Thresholding gets you judgment. You need both."

Key takeaways:

Use decision-theory (costs) when possible — it’s principled.
Use PR-based thresholds for rare-event problems.
Calibrate first, then threshold.
Validate thresholds with held-out or cross-validated data to avoid overfitting.

Now go pick a threshold like you mean it — and if anyone says “just use 0.5,” ask them about their utility matrix and whether they enjoy false alarms.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics