Classification II: Thresholding, Calibration, and Metrics
Make cost-aware decisions by selecting thresholds, calibrating probabilities, and using the right metrics.
Content
Accuracy, Precision, Recall, F1
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Classification II — Accuracy, Precision, Recall, F1
You already know how logistic regression gives you probabilities and how a confusion matrix lays out our sins and virtues. Now let's stop pretending a single number tells the whole truth.
Hook: The Diet Coke Thought Experiment
Imagine your classifier is a friend who promises you snacks. It says, with 80% confidence, "I will bring you a Diet Coke." Sometimes it does, sometimes it brings you chips, sometimes nothing. You must decide: do you trust that 80% and wait, or do you go buy your own drinks? Metrics are the baked-in disappointment detector and the trust-meter rolled into one.
You learned earlier that logistic regression gives probabilistic outputs. Those probabilities need to be turned into decisions. How good are those decisions? That's where accuracy, precision, recall, and F1 live.
Quick refresher (no rehashing the entire confusion-matrix lecture)
Here's the minimal reminder — the confusion matrix counts:
| Predicted \ Actual | Positive (P) | Negative (N) |
|---|---|---|
| Predicted Positive | True Positive (TP) | False Positive (FP) |
| Predicted Negative | False Negative (FN) | True Negative (TN) |
You saw this in Classification II: Confusion Matrix Anatomy. Now spawn the metrics.
The Metrics: formulas and plain-English
- Accuracy: fraction correct.
accuracy = (TP + TN) / (TP + TN + FP + FN)
- Precision (also called positive predictive value): "Of the things we predicted positive, how many actually were positive?"
precision = TP / (TP + FP)
- Recall (a.k.a. sensitivity, true positive rate): "Of all actual positives, how many did we catch?"
recall = TP / (TP + FN)
- F1 score: harmonic mean of precision and recall — punishes lopsidedness.
F1 = 2 * (precision * recall) / (precision + recall)
Why harmonic mean? Because if precision is 0.99 and recall is 0.01, the arithmetic mean would lie to you. The harmonic mean forces the score to reflect the bottleneck.
When accuracy lies (and why it will seduce you)
Imagine a fraud detection dataset where only 1% of transactions are fraudulent. A model that always predicts "not fraud" gets 99% accuracy. Congratulations, it's a master at being useless.
Accuracy is a blunt instrument when classes are imbalanced. It rewards majority-class guessing and hides the model's inability to find the rare, important class.
Ask yourself: what do I actually care about? False alarms or missed alarms? That determines whether you look at precision or recall.
Precision vs Recall: the legal drama
- High precision, low recall = you only shout 'guilty' when you're sure. Few false positives, many false negatives. Useful when false alarms are costly (e.g., banning users wrongly).
- High recall, low precision = you shout 'guilty' often, capturing more real criminals but also scaring many innocents. Useful when missing a positive is costly (e.g., disease screening).
Think of it as a courtroom: precision is the prosecutor's accuracy when they accuse; recall is the court's ability to find all guilty people.
Thresholding — where the rubber meets the road
Logistic regression gives you P(y=1|x). To turn probability into a label you pick a threshold t (default 0.5). But this is your lever.
- Lower t => more positives => recall goes up, precision may go down.
- Higher t => fewer positives => precision goes up, recall may go down.
This trade-off is continuous: sweep t from 0 to 1 and trace precision and recall. That's the precision-recall curve. If classes are imbalanced, PR curves give more informative views than ROC curves.
Quick code sketch (pseudocode):
for t in linspace(0,1,100):
preds = probs >= t
compute precision, recall
plot precision vs recall
Question: what threshold gives you the best F1? Often you choose the t that maximizes F1 on validation data. But remember: F1 implicitly treats precision and recall as equally important — which isn't always your world.
Calibration: probabilities you can trust
Calibration answers: when you say 0.8, is it actually 80%? Logistic regression is often fairly well-calibrated, but not always — especially in high-dimensional sparse settings or if you've overfit (remember Classification I: Sparse High-Dimensional and Overfitting issues). Overfitting ruins calibration: probabilities become overconfident.
Calibration matters because thresholding decisions assume the predicted probabilities are meaningful. If your model is miscalibrated, picking t based on those probabilities is like trusting a broken thermostat.
Tools:
- Reliability diagrams / calibration plots
- Platt scaling, isotonic regression to recalibrate
Worked example (small, satisfying math)
Suppose 1000 samples, 50 positives (disease). Your model at threshold 0.5 yields: TP=30, FN=20, FP=40, TN=910.
Compute:
- Accuracy = (30 + 910) / 1000 = 0.94
- Precision = 30 / (30 + 40) = 0.4286
- Recall = 30 / (30 + 20) = 0.6
- F1 = 2 * 0.4286 * 0.6 / (0.4286 + 0.6) ≈ 0.5
So 94% accuracy sounds great; precision 42% tells another story. If positive is disease, 42% means more than half flagged as sick are fine — that may be unacceptable.
Practical checklist (what to do in your model pipeline)
- Always inspect confusion matrix and class balance before worshiping accuracy.
- Choose metrics based on cost of FP vs FN (precision vs recall).
- Sweep thresholds and plot precision-recall curves; choose threshold on validation set, not test.
- Check calibration. If probabilities are off, recalibrate before thresholding.
- Use F1 when you care about a balance; use F-beta if you prefer recall (beta>1) or precision (beta<1).
- For imbalanced data prefer PR curves and average precision over ROC AUC.
Expert take: metrics are not properties of models alone; they are dialogues between your model, your threshold, your data distribution, and the business cost function. Treat them like conversation, not commandments.
Closing — the tiny, dramatic moral
Metrics are maps, not territories. Accuracy is a lazy map that shows the highway but hides the cliffs. Precision and recall are the binoculars that let you spot what matters. F1 tries to be diplomatic when you're torn.
Remember your training in probabilistic modeling: make the probabilities honest (calibration), then pick a threshold that aligns with your real-world costs. If you ignore that, your 94% accuracy will crash the party — and you'll be the only one applauding.
Key takeaways:
- Don't trust accuracy on imbalanced data.
- Precision and recall answer different operational questions — pick the one that maps to your cost of mistakes.
- Thresholds and calibration change everything; validate them.
- Use F1 (or F-beta) when you want a single-scalar summary that balances concerns.
Go forth, calibrate those probabilities, and may your TP be plentiful and your FN few.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!