Classification II: Thresholding, Calibration, and Metrics
Make cost-aware decisions by selecting thresholds, calibrating probabilities, and using the right metrics.
Content
ROC Curves and AUC
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
ROC Curves and AUC — The Art of Ranking, Not Guessing
You already know how to build a probabilistic classifier (thanks, logistic regression) and how to read a confusion matrix. Now we're going to sweep thresholds like a detective and measure the model's ranking power with style.
If you remember from previous sections: logistic regression gives you probabilities, the confusion matrix gives you counts at a fixed threshold, and precision/recall/F1 describe performance at that threshold. The ROC curve lifts you out of single-threshold handcuffs and asks: how good is my model at ranking positives above negatives as I slide the threshold from 1 to 0?
1) Quick refresher (so we can build rockets instead of repeating triangles)
- True Positive Rate (TPR) = TP / (TP + FN) — also called sensitivity or recall.
- False Positive Rate (FPR) = FP / (FP + TN) — proportion of negatives the model incorrectly calls positive.
ROC stands for Receiver Operating Characteristic and is a plot of TPR (y-axis) vs FPR (x-axis) as you vary the decision threshold over all possible values. Think of it as the path your model takes as it becomes greedier for positives.
2) Building intuition: the party bouncer metaphor
Imagine your model is a bouncer at a club, and scores are how attractive someone looks on paper. You set a threshold: above it, you let people in (predict positive). If the bouncer is strict (high threshold), few people get in — low FPR, maybe also low TPR. If the bouncer is lax (low threshold), many get in — high TPR, but also high FPR.
The ROC curve traces how TPR increases as FPR increases while you relax the bouncer's standards. A perfect bouncer sits at (0,1): no false positives and all true positives. A random bouncer waddles along the diagonal from (0,0) to (1,1).
3) What AUC actually measures (and why it's elegant)
- AUC = area under the ROC curve. Numerically between 0 and 1.
- AUC = 1 means a perfect ranking. AUC = 0.5 means random ranking. AUC < 0.5 means your model is worse than random (or you can flip its predictions).
Important interpretation: AUC is the probability that a randomly chosen positive instance receives a higher score than a randomly chosen negative instance. This is mathematically identical to the Mann-Whitney U statistic. So AUC cares about ordering, not calibrated probabilities.
4) How to compute it (conceptually and in code)
Conceptually: sweep all thresholds, compute (FPR, TPR) pairs, then integrate the curve (trapezoidal rule). Practically: use trusted libraries.
Code snippet (Python/sklearn):
from sklearn.metrics import roc_curve, roc_auc_score
# y_true: {0,1} labels; y_scores: model.predict_proba(X)[:,1]
fpr, tpr, thresholds = roc_curve(y_true, y_scores)
auc = roc_auc_score(y_true, y_scores)
Note: for SVMs use decision_function scores instead of probabilities.
5) Choosing thresholds vs using AUC
- The ROC curve helps you choose a threshold based on trade-offs between FPR and TPR: maybe you want high sensitivity, or maybe low false alarms.
- A popular single-number threshold heuristic: Youden's J = TPR - FPR (or sensitivity + specificity - 1). Choose the threshold that maximizes J.
Caveat: Youden's J ignores class prevalence and unequal costs. If false positives are way worse than false negatives (or vice versa), weight accordingly.
6) When ROC/AUC is awesome — and when it's misleading
Why ROC/AUC is great:
- Threshold-agnostic: summarizes performance across thresholds.
- Ranking-focused: indifferent to calibration; good when ranking is the goal (e.g., information retrieval, prioritization).
- Comparative: useful for comparing models when the task is to rank.
When to be cautious:
- Heavy class imbalance: a model can get a decent AUC while being useless for identifying positives in practice. For very sparse positives, Precision-Recall curves often tell a more realistic story.
- Calibration ignorance: a model with perfect rank order (high AUC) may give overconfident probabilities — so if you need accurate probabilities, check calibration (Platt scaling or isotonic regression).
- Business costs: equal weighting of TPR and FPR might not reflect real-world costs.
Quick rule: use ROC/AUC when you care about ranking; use PR curves when you care about actual positive-class precision, especially with imbalance.
7) ROC vs Precision-Recall (cheat table)
| Aspect | ROC | Precision-Recall |
|---|---|---|
| Best for | Ranking performance across thresholds | Focus on positive class performance (precision) |
| Sensitive to class imbalance? | Less sensitive | Very sensitive (and realistic) |
| Y-axis | TPR (recall) | Precision |
| X-axis | FPR | Recall |
Short takeaway: when positives are rare, PR curves show whether your positives are actually correct.
8) Advanced notes & practical tips
- AUC confidence intervals: use bootstrapping or DeLong test if you need to know whether differences between models are statistically significant.
- Multiclass ROC: use one-vs-rest per class and compute macro/micro averaged AUCs (micro average aggregates contributions of all classes; macro average treats classes equally).
- If you only have binary decisions (no scores), ROC degenerates to a few points — not very informative.
- AUC = 0.5 is baseline; flip the model if AUC < 0.5 and you suddenly have AUC' = 1 - AUC.
9) Quick worked example (conceptual sweep)
Imagine 3 positives with scores 0.9, 0.6, 0.2 and 3 negatives with scores 0.8, 0.4, 0.1.
- Rank scores: 0.9(P), 0.8(N), 0.6(P), 0.4(N), 0.2(P), 0.1(N).
- Compute probability a random P beats a random N: count favorable pairs / total pairs. Here, favorable pairs = 4/9 ≈ 0.44 => AUC ~ 0.44 (bad). The ROC curve built from thresholds will reflect this.
This demonstrates AUC literally counts how often positives outrank negatives.
Closing: Quick checklist before you ship a model
- Are you using predicted scores, not hard labels, to compute ROC/AUC? Good.
- Do you also inspect PR curves if the positive class is rare? Do that.
- Do you need calibrated probabilities? Check calibration plots and consider Platt scaling or isotonic regression.
- When choosing a threshold, base it on costs (business impact), not just on Youden's J.
Final mic drop:
AUC measures ranking elegance, not moral goodness. It tells you who your model prefers, not whether that preference is honest or well-priced. Use ROC/AUC for ranking power, PR for positive precision, and calibration methods when you need your probabilities to mean something.
Recommended next steps: run sklearn's roc_curve and roc_auc_score on your validation set, plot the ROC and PR curves side-by-side, and then pick a threshold with explicit cost-aware reasoning.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!