Classification II: Thresholding, Calibration, and Metrics
Make cost-aware decisions by selecting thresholds, calibrating probabilities, and using the right metrics.
Content
Precision–Recall Curves and AUC-PR
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Precision–Recall Curves and AUC-PR — The Romance of Rare Positives
"ROC is like the town square — everyone shows up. Precision–Recall is the speakeasy for the rare, interesting crowd."
You already met ROC curves in the previous lesson and you remember precision/recall/F1 like old friends. Now we dive into the guts of Precision–Recall (PR) curves and AUC-PR (Average Precision) — the metrics that actually care when the positive class is rare and costly (think fraud, disease, kidnapped kittens). This builds on the probabilistic outputs from logistic regression and on the thresholding ideas we've already covered.
Quick refresher (tight, not repetitive)
- From logistic regression we get probability estimates p(y=1|x). Good.
- Choose a threshold t to get class labels; varying t traces curves (ROC or PR).
- ROC plots True Positive Rate (Recall) vs False Positive Rate (FPR). PR plots Precision vs Recall.
Why PR instead of ROC? Because ROC can look shiny even when the model is useless on rare positives. PR zeroes in on the positive class performance — which is usually what you care about.
What is a PR curve, really?
- Recall (a.k.a. sensitivity, TPR): fraction of true positives we catch. Formula: TP / (TP + FN).
- Precision: fraction of predicted positives that are actually positive. Formula: TP / (TP + FP).
A Precision–Recall curve is a parametric curve obtained by sweeping the decision threshold t from 1 → 0 and plotting precision(t) vs recall(t). Each t gives a (precision, recall) point.
Think of it like: you lower the bar for saying 'this is positive' and you trade off: you catch more positives (higher recall) but also accept more false alarms (lower precision).
Visual intuition
- High recall, low precision: you're shouting "EVERYTHING IS A POSITIVE!" and you'll be right often enough but also very noisy.
- High precision, low recall: you're whispering "this is definitely positive" very selectively — fewer catches, but more trustworthy ones.
AUC-PR vs Average Precision vs AP smoothing
There are two things people often muddle:
- Area under the PR Curve (AUC-PR): geometric integral of the curve. If you compute the curve and take the area under it, you get AUC-PR. It depends on how you interpolate between points.
- Average Precision (AP) (scikit-learn’s default): a specific way of summarizing PR that weights precision by increases in recall; practically it's the area under the step-wise precision-recall curve after precision is interpolated to be non-increasing. AP emphasizes performance at higher ranks.
Important: AP has a useful connection to ranked outputs: if your classifier sorts instances by score, AP is the weighted mean of precisions at each true positive’s position.
Expert take: AP is often more informative than a naive trapezoidal AUC estimate because it accounts for the discrete, ranked nature of predictions.
Baselines and why prevalence matters
Big warning label: the baseline AUC-PR is the prevalence of positives (the fraction of positives in your dataset). So if positives are 1%, a dumb random classifier has expected precision ~1% — your PR curve must beat that to be useful.
Contrast with ROC: a random classifier has AUC-ROC = 0.5 regardless of class balance. That’s why ROC can be misleading on imbalanced problems.
Table: quick comparison
| Property | ROC AUC | PR AUC / AP |
|---|---|---|
| Sensitive to class imbalance? | Not much | Yes — baseline = prevalence |
| Focus | overall rank (pos vs neg) | performance on positives |
| Best used when | classes balanced OR costs symmetric | positive class rare or expensive |
Practical example (imagine): disease screening
You have 10,000 patients, only 100 have the disease (1%). Two models both get AUC-ROC = 0.95. One model's PR curve has AP=0.15, the other's AP=0.50. Which do you trust? The second — it yields more useful positive predictions. ROC lied to you about real-world utility.
Ask yourself: "Do I care about catching as many sick people as possible, or minimizing false alarms?" Your business answer chooses a point on the PR curve (a threshold).
How to compute (scikit-learn snippet)
from sklearn.metrics import precision_recall_curve, average_precision_score
# y_true: binary labels (0/1)
# y_scores: continuous scores (probabilities from logistic regression)
precision, recall, thresholds = precision_recall_curve(y_true, y_scores)
ap = average_precision_score(y_true, y_scores)
print('Average Precision (AP):', ap)
# Plot precision vs recall with your favorite plotting lib
Note: precision_recall_curve returns precision and recall arrays where precision is not necessarily monotonic; average_precision_score applies the recommended interpolation.
Threshold selection using the PR curve
Want a single operating point? Options:
- Maximize F1 on the validation set (harmonic mean of precision and recall). Good default when precision and recall are equally important.
- Choose threshold where precision meets a business requirement (e.g., "precision must be >= 90%"), then maximize recall under that constraint.
- Optimize expected cost: assign costs to FP and FN and pick threshold minimizing expected cost.
Probing questions: "What happens to the PR curve when we calibrate probabilities?" Calibrating (Platt scaling, isotonic) makes probabilities more faithful; PR ranking depends on score ordering, so calibration doesn't change ranking if monotonic but can change thresholds and confidence interpretation.
Pitfalls and gotchas
- Precision is undefined if you predict no positives (handle gracefully).
- PR curves can be jagged for small sample sizes; use smoothing or bootstrapped confidence bands for reliability.
- AP and AUC-PR are dataset-dependent; report the prevalence or use cross-validation.
- Comparing models: differences in AUC-PR are meaningful only if sampling and prevalence are consistent.
Multiclass PR
You can extend PR to multiclass via one-vs-rest and compute macro or micro averages:
- micro-average: aggregate contributions of all classes; good when each instance is equally important.
- macro-average: average class-wise APs; treats classes equally regardless of prevalence.
Quick checklist (before you ship your metric)
- Is the positive class rare or high-cost? Use PR/AP.
- Report prevalence along with AP.
- Use cross-validated AP and bootstrapped CIs if possible.
- Pick thresholds based on operational constraints (precision target or cost function), not just F1.
Closing: TL;DR (with attitude)
- PR curves focus on what matters when positives are rare: how precise are your positive predictions as you increase recall?
- AUC-PR / AP summarize that curve — but remember the baseline is prevalence. Don't be fooled by a shiny ROC when your business will drown in false positives.
- Use PR curves to choose thresholds that align with business needs, and always validate with bootstrapping or cross-validation.
Final thought: ROC is the generalist; PR is the specialist. When the stakes are catching the few precious positives, take the specialist — and remember to calibrate your confidence and pick thresholds like you actually care about outcomes.
Version notes: This piece built directly on our ROC + basic metric discussion and assumes you get probabilistic scores from logistic regression or similar models. If you want, I can add visualization code, bootstrap CI examples, or a short rubric for threshold choice based on specific cost matrices.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!