Classification II: Thresholding, Calibration, and Metrics
Make cost-aware decisions by selecting thresholds, calibrating probabilities, and using the right metrics.
Content
Confusion Matrix Anatomy
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Confusion Matrix Anatomy — The Diagnostic Table That Tells You If Your Model Is a Hero or a Villain
"Probabilities are great. But when you pull the emergency brake and pick a threshold, everything either lives or dies — welcome to the confusion matrix."
You already learned how logistic regression hands you probabilities (see: Classification I). Nice, soft numbers between 0 and 1. But real-world decisions usually want actions: accept, reject, alarm, send to manual review. The confusion matrix is the little table that records what happens when probabilistic models make decisions — and it’s the backbone of nearly every classification metric you’ll meet.
What is a confusion matrix? Quick sketch
A confusion matrix is a 2×2 contingency table for binary classification that counts outcomes when predictions meet reality.
| Actual \ Predicted | Positive (1) | Negative (0) |
|---|---|---|
| Positive (1) | True Positive (TP) | False Negative (FN) |
| Negative (0) | False Positive (FP) | True Negative (TN) |
- TP: Model said "yes" and reality was "yes".
- TN: Model said "no" and reality was "no".
- FP: Model cried wolf — predicted "yes" but it was "no".
- FN: Model missed it — predicted "no" but it was actually "yes".
Imagine a COVID test or a fraud detector. A false negative might be someone infected who gets sent home; a false positive might be an innocent person flagged for quarantine.
Why the confusion matrix matters (more than accuracy alone)
If your dataset has 99% negatives and your model predicts "negative" always, you get 99% accuracy. Party time? No. This is the accuracy paradox. The confusion matrix reveals the imbalance hidden by accuracy.
Consider these derived quantities (the usual suspects):
- Accuracy = (TP + TN) / (TP + FP + TN + FN)
- Precision (Positive Predictive Value) = TP / (TP + FP)
- Recall (Sensitivity, True Positive Rate) = TP / (TP + FN)
- Specificity (True Negative Rate) = TN / (TN + FP)
- F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
Each of these is just a function of the four cells. Think of them as different drugs you can take depending on whether you want to be conservative (precision) or inclusive (recall).
Anatomy walk-through — cell by cell (with analogies)
- TP — The model correctly detects a fraud. Gold star. But not everything should be TP maximized.
- TN — Correctly ignores a normal transaction. Silent victory.
- FP — False alarm. It's the "crying wolf" tax. Cost depends on context: UX annoyance vs. expensive manual review.
- FN — The true silent killer. Missed fraud, missed disease. Often the most costly.
Ask yourself: which is worse in your problem — FP or FN? That drives thresholds and metrics.
From probabilities to cells: thresholding
Your logistic regression gives p(y=1 | x). To make the confusion matrix you pick a threshold t and predict positive if p >= t.
Pseudo-Python:
# probs: array of model probabilities
# y_true: array of true labels (0/1)
threshold = 0.5 # arbitrary default
y_pred = (probs >= threshold).astype(int)
confusion = confusion_matrix(y_true, y_pred)
# returns [[TN, FP], [FN, TP]] in many libraries
Vary t from 0 to 1 and the (TP, FP, TN, FN) move like tectonic plates — that's the essence of ROC and Precision-Recall curves.
Prevalence matters — class imbalance and base rates
Prevalence (also called base rate) = (TP + FN) / N = proportion of positives in population.
High/low prevalence changes interpretation:
- With very low prevalence, precision tends to suffer even with decent recall — many positives mean many false alarms from the many negatives.
- Balanced accuracy or Matthews Correlation Coefficient (MCC) are useful when classes are imbalanced because they combine sensitivity and specificity in a more symmetric way.
Quick formulas:
- Balanced Accuracy = (Sensitivity + Specificity) / 2
- MCC = (TPTN - FPFN) / sqrt((TP+FP)(TP+FN)(TN+FP)(TN+FN))
MCC ranges from -1 (total disagreement) to +1 (perfect prediction). Use it when you want a single number that respects imbalance.
Example confusions: a tiny table that tells stories
Suppose N=1000, prevalence 5% (50 positives). Two models:
Model A: Always predict negative -> TP=0, FN=50, TN=950, FP=0 -> Accuracy = 95% (still terrible at the job)
Model B: Picks up 40 of the 50 positives but flags 200 negatives -> TP=40, FN=10, FP=200, TN=750 ->
- Precision = 40 / (40+200) = 0.167
- Recall = 40 / 50 = 0.8
Which is better? Depends on cost. If missing a positive is catastrophic, Model B (high recall) is better despite low precision. If every false alarm costs money, maybe not.
Calibration meets the confusion matrix
You learned about models that output probabilities (Classification I). Calibration asks: do those probabilities mean anything? If a model outputs 0.8 for a hundred cases, do ~80 of them truly belong to the positive class?
A calibrated model makes thresholding meaningful: you can set a threshold using expected costs. An uncalibrated model might rank examples OK (so ROC AUC is decent) but its absolute probabilities are misleading for decision thresholds.
Practical note: calibrate when you need reliable probabilities (Platt scaling, isotonic regression). If you just need ranking, calibration isn't strictly necessary.
Quick decision guide: what metric to use
- You care about catching positives (e.g., disease screening): Maximize recall; monitor precision; use F1 if you want balance.
- You care about avoiding false alarms (e.g., expensive manual review): Maximize precision; track recall.
- Data heavily imbalanced: Use AUC-ROC for ranking; PR-AUC or MCC or Balanced Accuracy for thresholds.
Closing: The moral of the matrix
The confusion matrix is small but mercilessly honest. It forces you to name the costs of errors and to choose thresholds intentionally. Probabilities from logistic regression are powerful — but turning them into actions without consulting the matrix is like sending troops into battle with a coin flip.
"Metrics are not just numbers — they're moral choices in disguise. Which errors are you willing to live with?"
Key takeaways:
- Build the confusion matrix after you threshold probabilities; it's the source for all classification metrics.
- Accuracy can lie — always inspect TP, FP, TN, FN.
- Thresholding links probabilities to action; calibration makes those probabilities trustworthy.
- Choose metrics based on domain cost: precision vs recall vs a balanced view (F1, MCC).
Go forth: pick a threshold, compute the matrix, and ask the hard question — what happens if I get this wrong? Your model (and stakeholders) will thank you — or sue you. Either way, you'll know why.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!