Classification II: Thresholding, Calibration, and Metrics
Make cost-aware decisions by selecting thresholds, calibrating probabilities, and using the right metrics.
Content
Cost Curves and Expected Utility
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Cost Curves and Expected Utility — The Glorious Economics of Decisions
"Metrics are cute, but dollars (or lives, or server time) pay the bills." — Your friendly decision-theory TA
You're already fresh off learning how to pick thresholds and read precision–recall curves, and you know how logistic regression gives you probabilities instead of just binary screeds. Now we ask: how do we turn those probabilities into decisions that optimize what actually matters — utility (or, equivalently, minimize cost)? Welcome to cost curves and expected utility: the place where math meets money and moral dilemmas (false positives vs false negatives).
What's the point (quick)?
If you can estimate P(y=1 | x) (hello, logistic regression), the optimal decision depends not just on that probability but on the relative costs of mistakes and the class prevalence. Cost curves are a way to visualize how a classifier performs across all possible trade-offs between those costs and prevalence — and expected utility tells you which threshold to pick once you've specified costs.
The setup: costs, errors, and expected cost
Imagine a binary classifier. There are two mistakes:
- False Positive (FP): predict 1 when true label = 0. Cost: C_FP
- False Negative (FN): predict 0 when true label = 1. Cost: C_FN
(Yes, you can call them "annoying consequences" instead — costs can be monetary, reputational, or life-or-death.)
Given a threshold t on the model's score s(x) (or on P(y=1|x)), define:
- FPR_t = P(pred=1 | y=0) at threshold t
- FNR_t = P(pred=0 | y=1) at threshold t
Then the expected cost (EC) for prior p = P(y=1) is:
EC(t; p) = C_FN * p * FNR_t + C_FP * (1 - p) * FPR_t
That's it. Two error rates weighted by class prevalence and the cost of each type of error.
Interpretation: Think of p * C_FN as the total "risk mass" assigned to positive-class errors, and (1-p) * C_FP to negative-class errors. The classifier splits those masses according to its FNR and FPR.
Bayes decision rule (aka pick the threshold like a grown-up)
For a probabilistic classifier that gives p_hat = P(y=1 | x), compare the expected costs of predicting 1 vs predicting 0 for this single example:
- If you predict 1: expected cost = C_FP * (1 - p_hat)
- If you predict 0: expected cost = C_FN * p_hat
Predict 1 when:
C_FP * (1 - p_hat) <= C_FN * p_hat
Rearrange:
p_hat >= C_FP / (C_FP + C_FN)
So the optimal threshold (for this cost pair) is t* = C_FP / (C_FP + C_FN).
Nice consequences:
- It depends on the ratio of costs, not their absolute scale.
- If C_FP = C_FN, threshold = 0.5 (as you'd expect).
- If false negatives are very expensive (C_FN >> C_FP), threshold gets small — be generous calling positives.
Key point: this neat thresholding requires well-calibrated probabilities. Garbage probabilities → garbage decisions.
Cost Curves (Drummond & Holte style) — visualize all operating points
A big pain: real-world costs and class prevalence vary. You might deploy the same model in two countries (different p) or suddenly the cost of an FP spikes (regulation). Instead of committing to one (p, costs) pair, we can look at performance across the whole spectrum.
Construct two transformations:
- Probability–Cost Function (PCF):
PCF = (p * C_FN) / (p * C_FN + (1 - p) * C_FP)
This compresses class prior and costs into a single axis variable between 0 and 1. Intuitively, PCF is the relative weight placed on positive-class errors.
- Normalized Expected Cost (NEC):
NEC(t; PCF) = FNR_t * PCF + FPR_t * (1 - PCF)
Now plot NEC on the y-axis vs PCF on the x-axis for your classifier (often you do this for a family of thresholds, forming a piecewise-linear curve). Each point tells you the normalized expected cost for that operating point (a blend of prevalence and cost ratio).
Why normalized? NEC avoids absolute cost scales so curves from different datasets or cost-schemes are comparable.
How to read a cost curve (the meme version)
- If classifier A's curve lies below B's for a range of PCF, A dominates there — lower normalized expected cost for those cost/prior mixes.
- The convex hull of these curves tells you the best choice if you can change thresholds post-hoc.
- Crossing curves = pick-your-poison: one classifier better when false negatives costly, the other when false positives costly.
Question to ask yourself: "What PCF region is my deployment in?" If you care about very rare positives and huge cost of missing them (medical screening), you're in a corner of the x-axis and you can pick accordingly.
From theory to practice: how to compute expected cost (pseudocode)
# Given: y_true, p_hat, cost_fp, cost_fn, grid of thresholds T, grid of priors p_grid
for t in T:
pred = p_hat >= t
FPR = FP / N_negative
FNR = FN / N_positive
for p in p_grid:
EC[t,p] = cost_fn * p * FNR + cost_fp * (1-p) * FPR
# Or transform p and costs to PCF and compute normalized expected cost
(Use cross-validation or a separate validation set to estimate FPR/FNR — do not cheat with test labels when picking thresholds.)
Practical tips and trade-offs
- Calibration matters. If your probabilities are miscalibrated, thresholds from Bayes rule will be wrong. Use Platt scaling / isotonic regression.
- AUC is not enough. AUC summarizes ranking, but cost curves capture where ranking errors actually cost you. Two models with similar AUC can have very different expected costs in realistic PCF ranges.
- If you know costs, optimize them directly. If you can assign monetary utility, pick the threshold that maximizes expected utility on validation data (or train cost-sensitive models).
- When costs are uncertain, use cost curves. They show robustness across assumptions.
- Don't forget class priors shift. Even if costs fixed, deployment prevalence p can move; cost curves let you see sensitivity.
Quick comparison table
| Concept | What it shows | When to use |
|---|---|---|
| AUC-ROC / AUC-PR | Ranking performance across thresholds | General model selection; ranking-heavy tasks |
| Precision–Recall curves | Behavior on positive class (sensitive to class imbalance) | Rare positive detection |
| Cost curves / NEC | Expected (normalized) cost over all cost/prior mixes | When costs/priors matter or vary |
Final flourish — key takeaways
- Expected cost = weighted sum of FPR and FNR; weights come from class prior and misclassification costs.
- With calibrated probabilities, the Bayes optimal threshold is t* = C_FP / (C_FP + C_FN).
- Cost curves compress prior+cost into a PCF axis and let you visualize performance across operating conditions — use them when costs or prevalence are uncertain.
- Calibration + cost-sensitive thinking = decisions that actually improve utility, not just metrics.
Parting thought: metrics tell you how your model behaves; cost curves tell you how much its misbehavior will hurt. Optimize the latter if you care about consequences — which you should.
Version: "Costly Choices — Practical Decision Theory with Sass"
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!