Classification II: Thresholding, Calibration, and Metrics
Make cost-aware decisions by selecting thresholds, calibrating probabilities, and using the right metrics.
Content
Threshold Selection Strategies
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Threshold Selection Strategies — Where Probabilities Go to Wear Costumes
"A probability isn't a decision. A threshold is the wardrobe change."
You already know how to get a probability out of a model (hello, logistic regression and friends). You also know how to visualize classifier behavior across thresholds using ROC and PR curves. Great — those were the rehearsals. Now we pick the outfit for opening night: the threshold. This note walks through principled, practical, and delightfully pragmatic ways to choose a threshold for binary classification.
Why thresholding deserves a moment of existential thought
- Your model spits out p = P(y=1 | x). That’s probability, not a verdict. A threshold turns p into a yes/no call.
- Different thresholds change precision, recall, specificity, F1, and business outcomes. ROC/PR curves showed you the landscape — threshold selection chooses the vantage point.
- Bad thresholds = wasted effort, false alarms, missed opportunities, possibly regulatory trouble. Choose carefully.
The core decision-theory rule (aka the adult way to set a threshold)
If false positives cost c_fp and false negatives cost c_fn, minimize expected cost by predicting positive when:
p > c_fp / (c_fp + c_fn)
Equivalently, using odds:
p/(1-p) > c_fp / c_fn
Why this works: choosing positive risks a FP with probability (1-p), costing (1-p)c_fp; choosing negative risks a FN with probability p, costing pc_fn. Compare the two.
This is Bayes-style decision making. It depends on your costs, not on some arbitrary 0.5.
Practical tip: always convert business penalties into relative costs (c_fp vs c_fn). If hospital readmission is catastrophic and a false alarm is cheap, set threshold low.
Simple strategies you’ll actually use in the wild
- Fixed default (0.5)
- Pros: simple. Cons: assumes calibrated probabilities and balanced costs/classes. Often wrong.
- Maximize a metric on validation set (F1, accuracy, MCC)
- Compute metric for many thresholds; pick argmax. Works if metric reflects business goal.
- Youden's J (ROC-based)
- Choose threshold maximizing Sensitivity + Specificity - 1 (TPR - FPR). Good when you treat errors symmetrically.
- Minimize distance to top-left on ROC
- Choose threshold minimizing sqrt((1-TPR)^2 + FPR^2). Geometric heuristic.
- PR-based selection
- If classes are imbalanced and precision matters, use PR curve to find threshold giving required precision or recall.
- Cost-based threshold (see decision-theory above)
- Use when you can quantify costs.
When to use ROC vs PR for picking thresholds
- ROC-based rules (Youden, min-distance) assume roughly equal class importance and are insensitive to class imbalance.
- PR-based selection is better when positive class is rare and you care about precision/recall trade-offs. A high ROC AUC can hide bad precision at relevant recall levels.
Think: ROC tells you the ability to rank positives above negatives; PR tells you how many of the things you call positive are actually positive. Use PR when false-positives are painful or positives are rare.
Calibration matters — don’t threshold on lies
If your model is miscalibrated, a probability p of 0.6 might not mean 60% true positives. Thresholds that rely on absolute p (like cost-based thresholds) require calibration.
Common calibration fixes:
- Platt scaling (sigmoid / parametric) — fits a logistic to model scores on validation data
- Isotonic regression — non-parametric, more flexible but needs more data
Always calibrate on a held-out set, then pick thresholds on another held-out set (or use nested CV). Otherwise you’ll overfit the threshold to noise.
Algorithmic recipe (pseudocode) — pick threshold by maximizing F1
# inputs: val_probs (N), val_labels (0/1), thresholds = np.linspace(0,1,1000)
best_t, best_f1 = 0, -inf
for t in thresholds:
preds = val_probs >= t
f1 = f1_score(val_labels, preds)
if f1 > best_f1:
best_f1 = f1
best_t = t
# use best_t on test/production
Notes: repeat with cross-validation to estimate variability and avoid overfitting to a single validation split.
Advanced/robust approaches
- Cross-validated thresholding: pick thresholds in each fold then average or pick most frequent threshold
- Cost curves & decision curves: plot net benefit over thresholds to pick based on utility rather than metrics
- Per-group thresholds: different thresholds for subpopulations when base rates differ (be careful with fairness implications)
- Reject option (abstain): allow the model to say "I don't know" when p is near the threshold; route to human review
Table — quick compare of selection strategies
| Strategy | When to use | Pros | Cons |
|---|---|---|---|
| Default 0.5 | Quick prototypes | Simple | Often wrong with imbalance/costs |
| Max F1 / MCC | Metric-driven goals | Directly optimizes your metric | Overfits if no holdout; metric-dependent |
| Youden's J | Symmetric errors | Simple ROC-based | Ignores prevalence |
| Min-distance (ROC) | General tradeoff | Intuitive geometry | Not cost-aware |
| PR-based | Rare positives | Focuses on precision/recall | Can be noisy with few positives |
| Cost-based (Bayes) | Known costs | Decision-theory optimal | Needs quantifiable costs & calibration |
A practical checklist before you deploy
- Is the probability well-calibrated? If not, calibrate.
- Do you know the relative costs of FP and FN? If yes, use cost-based thresholding.
- If you must optimize a metric (F1, MCC), pick threshold on a held-out set or via CV.
- If positives are rare, prefer PR-guided thresholds over naive ROC heuristics.
- Compute confidence intervals for performance at the chosen threshold.
- Consider a reject option if misclassifications are costly.
Parting shot (a tiny rant and a tiny wisdom)
Choosing a threshold is the most human part of modeling: it requires values, priorities, and trade-offs. Your model gives you probabilities; your organization gives you consequences. Bring both to the table.
"Calibration gets you honesty. Thresholding gets you judgment. You need both."
Key takeaways:
- Use decision-theory (costs) when possible — it’s principled.
- Use PR-based thresholds for rare-event problems.
- Calibrate first, then threshold.
- Validate thresholds with held-out or cross-validated data to avoid overfitting.
Now go pick a threshold like you mean it — and if anyone says “just use 0.5,” ask them about their utility matrix and whether they enjoy false alarms.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!