Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

Confusion Matrix Anatomy Accuracy, Precision, Recall, F1 ROC Curves and AUC Precision–Recall Curves and AUC-PR Threshold Selection Strategies Cost Curves and Expected Utility Probability Calibration Methods Brier Score and Log Loss Multiclass Metrics and Averaging Ranking Metrics for Imbalanced Data Top-k and Coverage Metrics Macro vs Micro vs Weighted Scores Cumulative Gain and Lift Charts Calibration Plots and Reliability Decision Curves and Net Benefit

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Classification II: Thresholding, Calibration, and Metrics

Classification II: Thresholding, Calibration, and Metrics

32355 views

Make cost-aware decisions by selecting thresholds, calibrating probabilities, and using the right metrics.

Content

4 of 15

Precision–Recall Curves and AUC-PR

PR Curve, Sass & Substance

3286 views

intermediate

humorous

visual

science

gpt-5-mini

3286 views

Versions:

PR Curve, Sass & Substance

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Precision–Recall Curves and AUC-PR — The Romance of Rare Positives

"ROC is like the town square — everyone shows up. Precision–Recall is the speakeasy for the rare, interesting crowd."

You already met ROC curves in the previous lesson and you remember precision/recall/F1 like old friends. Now we dive into the guts of Precision–Recall (PR) curves and AUC-PR (Average Precision) — the metrics that actually care when the positive class is rare and costly (think fraud, disease, kidnapped kittens). This builds on the probabilistic outputs from logistic regression and on the thresholding ideas we've already covered.

Quick refresher (tight, not repetitive)

From logistic regression we get probability estimates p(y=1|x). Good.
Choose a threshold t to get class labels; varying t traces curves (ROC or PR).
ROC plots True Positive Rate (Recall) vs False Positive Rate (FPR). PR plots Precision vs Recall.

Why PR instead of ROC? Because ROC can look shiny even when the model is useless on rare positives. PR zeroes in on the positive class performance — which is usually what you care about.

What is a PR curve, really?

Recall (a.k.a. sensitivity, TPR): fraction of true positives we catch. Formula: TP / (TP + FN).
Precision: fraction of predicted positives that are actually positive. Formula: TP / (TP + FP).

A Precision–Recall curve is a parametric curve obtained by sweeping the decision threshold t from 1 → 0 and plotting precision(t) vs recall(t). Each t gives a (precision, recall) point.

Think of it like: you lower the bar for saying 'this is positive' and you trade off: you catch more positives (higher recall) but also accept more false alarms (lower precision).

Visual intuition

High recall, low precision: you're shouting "EVERYTHING IS A POSITIVE!" and you'll be right often enough but also very noisy.
High precision, low recall: you're whispering "this is definitely positive" very selectively — fewer catches, but more trustworthy ones.

AUC-PR vs Average Precision vs AP smoothing

There are two things people often muddle:

Area under the PR Curve (AUC-PR): geometric integral of the curve. If you compute the curve and take the area under it, you get AUC-PR. It depends on how you interpolate between points.
Average Precision (AP) (scikit-learn’s default): a specific way of summarizing PR that weights precision by increases in recall; practically it's the area under the step-wise precision-recall curve after precision is interpolated to be non-increasing. AP emphasizes performance at higher ranks.

Important: AP has a useful connection to ranked outputs: if your classifier sorts instances by score, AP is the weighted mean of precisions at each true positive’s position.

Expert take: AP is often more informative than a naive trapezoidal AUC estimate because it accounts for the discrete, ranked nature of predictions.

Baselines and why prevalence matters

Big warning label: the baseline AUC-PR is the prevalence of positives (the fraction of positives in your dataset). So if positives are 1%, a dumb random classifier has expected precision ~1% — your PR curve must beat that to be useful.

Contrast with ROC: a random classifier has AUC-ROC = 0.5 regardless of class balance. That’s why ROC can be misleading on imbalanced problems.

Table: quick comparison

Property	ROC AUC	PR AUC / AP
Sensitive to class imbalance?	Not much	Yes — baseline = prevalence
Focus	overall rank (pos vs neg)	performance on positives
Best used when	classes balanced OR costs symmetric	positive class rare or expensive

Practical example (imagine): disease screening

You have 10,000 patients, only 100 have the disease (1%). Two models both get AUC-ROC = 0.95. One model's PR curve has AP=0.15, the other's AP=0.50. Which do you trust? The second — it yields more useful positive predictions. ROC lied to you about real-world utility.

Ask yourself: "Do I care about catching as many sick people as possible, or minimizing false alarms?" Your business answer chooses a point on the PR curve (a threshold).

How to compute (scikit-learn snippet)

from sklearn.metrics import precision_recall_curve, average_precision_score

# y_true: binary labels (0/1)
# y_scores: continuous scores (probabilities from logistic regression)
precision, recall, thresholds = precision_recall_curve(y_true, y_scores)
ap = average_precision_score(y_true, y_scores)
print('Average Precision (AP):', ap)

# Plot precision vs recall with your favorite plotting lib

Note: precision_recall_curve returns precision and recall arrays where precision is not necessarily monotonic; average_precision_score applies the recommended interpolation.

Threshold selection using the PR curve

Want a single operating point? Options:

Maximize F1 on the validation set (harmonic mean of precision and recall). Good default when precision and recall are equally important.
Choose threshold where precision meets a business requirement (e.g., "precision must be >= 90%"), then maximize recall under that constraint.
Optimize expected cost: assign costs to FP and FN and pick threshold minimizing expected cost.

Probing questions: "What happens to the PR curve when we calibrate probabilities?" Calibrating (Platt scaling, isotonic) makes probabilities more faithful; PR ranking depends on score ordering, so calibration doesn't change ranking if monotonic but can change thresholds and confidence interpretation.

Pitfalls and gotchas

Precision is undefined if you predict no positives (handle gracefully).
PR curves can be jagged for small sample sizes; use smoothing or bootstrapped confidence bands for reliability.
AP and AUC-PR are dataset-dependent; report the prevalence or use cross-validation.
Comparing models: differences in AUC-PR are meaningful only if sampling and prevalence are consistent.

Multiclass PR

You can extend PR to multiclass via one-vs-rest and compute macro or micro averages:

micro-average: aggregate contributions of all classes; good when each instance is equally important.
macro-average: average class-wise APs; treats classes equally regardless of prevalence.

Quick checklist (before you ship your metric)

Is the positive class rare or high-cost? Use PR/AP.
Report prevalence along with AP.
Use cross-validated AP and bootstrapped CIs if possible.
Pick thresholds based on operational constraints (precision target or cost function), not just F1.

Closing: TL;DR (with attitude)

PR curves focus on what matters when positives are rare: how precise are your positive predictions as you increase recall?
AUC-PR / AP summarize that curve — but remember the baseline is prevalence. Don't be fooled by a shiny ROC when your business will drown in false positives.
Use PR curves to choose thresholds that align with business needs, and always validate with bootstrapping or cross-validation.

Final thought: ROC is the generalist; PR is the specialist. When the stakes are catching the few precious positives, take the specialist — and remember to calibrate your confidence and pick thresholds like you actually care about outcomes.

Version notes: This piece built directly on our ROC + basic metric discussion and assumes you get probabilistic scores from logistic regression or similar models. If you want, I can add visualization code, bootstrap CI examples, or a short rubric for threshold choice based on specific cost matrices.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics