Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

Confusion Matrix Anatomy Accuracy, Precision, Recall, F1 ROC Curves and AUC Precision–Recall Curves and AUC-PR Threshold Selection Strategies Cost Curves and Expected Utility Probability Calibration Methods Brier Score and Log Loss Multiclass Metrics and Averaging Ranking Metrics for Imbalanced Data Top-k and Coverage Metrics Macro vs Micro vs Weighted Scores Cumulative Gain and Lift Charts Calibration Plots and Reliability Decision Curves and Net Benefit

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Classification II: Thresholding, Calibration, and Metrics

Classification II: Thresholding, Calibration, and Metrics

32355 views

Make cost-aware decisions by selecting thresholds, calibrating probabilities, and using the right metrics.

Content

2 of 15

Accuracy, Precision, Recall, F1

Metrics with Sass and Substance

7999 views

intermediate

humorous

machine learning

gpt-5-mini

7999 views

Versions:

Metrics with Sass and Substance

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Classification II — Accuracy, Precision, Recall, F1

You already know how logistic regression gives you probabilities and how a confusion matrix lays out our sins and virtues. Now let's stop pretending a single number tells the whole truth.

Hook: The Diet Coke Thought Experiment

Imagine your classifier is a friend who promises you snacks. It says, with 80% confidence, "I will bring you a Diet Coke." Sometimes it does, sometimes it brings you chips, sometimes nothing. You must decide: do you trust that 80% and wait, or do you go buy your own drinks? Metrics are the baked-in disappointment detector and the trust-meter rolled into one.

You learned earlier that logistic regression gives probabilistic outputs. Those probabilities need to be turned into decisions. How good are those decisions? That's where accuracy, precision, recall, and F1 live.

Quick refresher (no rehashing the entire confusion-matrix lecture)

Here's the minimal reminder — the confusion matrix counts:

Predicted \ Actual	Positive (P)	Negative (N)
Predicted Positive	True Positive (TP)	False Positive (FP)
Predicted Negative	False Negative (FN)	True Negative (TN)

You saw this in Classification II: Confusion Matrix Anatomy. Now spawn the metrics.

The Metrics: formulas and plain-English

Accuracy: fraction correct.

accuracy = (TP + TN) / (TP + TN + FP + FN)

Precision (also called positive predictive value): "Of the things we predicted positive, how many actually were positive?"

precision = TP / (TP + FP)

Recall (a.k.a. sensitivity, true positive rate): "Of all actual positives, how many did we catch?"

recall = TP / (TP + FN)

F1 score: harmonic mean of precision and recall — punishes lopsidedness.

F1 = 2 * (precision * recall) / (precision + recall)

Why harmonic mean? Because if precision is 0.99 and recall is 0.01, the arithmetic mean would lie to you. The harmonic mean forces the score to reflect the bottleneck.

When accuracy lies (and why it will seduce you)

Imagine a fraud detection dataset where only 1% of transactions are fraudulent. A model that always predicts "not fraud" gets 99% accuracy. Congratulations, it's a master at being useless.

Accuracy is a blunt instrument when classes are imbalanced. It rewards majority-class guessing and hides the model's inability to find the rare, important class.

Ask yourself: what do I actually care about? False alarms or missed alarms? That determines whether you look at precision or recall.

Precision vs Recall: the legal drama

High precision, low recall = you only shout 'guilty' when you're sure. Few false positives, many false negatives. Useful when false alarms are costly (e.g., banning users wrongly).
High recall, low precision = you shout 'guilty' often, capturing more real criminals but also scaring many innocents. Useful when missing a positive is costly (e.g., disease screening).

Think of it as a courtroom: precision is the prosecutor's accuracy when they accuse; recall is the court's ability to find all guilty people.

Thresholding — where the rubber meets the road

Logistic regression gives you P(y=1|x). To turn probability into a label you pick a threshold t (default 0.5). But this is your lever.

Lower t => more positives => recall goes up, precision may go down.
Higher t => fewer positives => precision goes up, recall may go down.

This trade-off is continuous: sweep t from 0 to 1 and trace precision and recall. That's the precision-recall curve. If classes are imbalanced, PR curves give more informative views than ROC curves.

Quick code sketch (pseudocode):

for t in linspace(0,1,100):
    preds = probs >= t
    compute precision, recall
plot precision vs recall

Question: what threshold gives you the best F1? Often you choose the t that maximizes F1 on validation data. But remember: F1 implicitly treats precision and recall as equally important — which isn't always your world.

Calibration: probabilities you can trust

Calibration answers: when you say 0.8, is it actually 80%? Logistic regression is often fairly well-calibrated, but not always — especially in high-dimensional sparse settings or if you've overfit (remember Classification I: Sparse High-Dimensional and Overfitting issues). Overfitting ruins calibration: probabilities become overconfident.

Calibration matters because thresholding decisions assume the predicted probabilities are meaningful. If your model is miscalibrated, picking t based on those probabilities is like trusting a broken thermostat.

Tools:

Reliability diagrams / calibration plots
Platt scaling, isotonic regression to recalibrate

Worked example (small, satisfying math)

Suppose 1000 samples, 50 positives (disease). Your model at threshold 0.5 yields: TP=30, FN=20, FP=40, TN=910.

Compute:

Accuracy = (30 + 910) / 1000 = 0.94
Precision = 30 / (30 + 40) = 0.4286
Recall = 30 / (30 + 20) = 0.6
F1 = 2 * 0.4286 * 0.6 / (0.4286 + 0.6) ≈ 0.5

So 94% accuracy sounds great; precision 42% tells another story. If positive is disease, 42% means more than half flagged as sick are fine — that may be unacceptable.

Practical checklist (what to do in your model pipeline)

Always inspect confusion matrix and class balance before worshiping accuracy.
Choose metrics based on cost of FP vs FN (precision vs recall).
Sweep thresholds and plot precision-recall curves; choose threshold on validation set, not test.
Check calibration. If probabilities are off, recalibrate before thresholding.
Use F1 when you care about a balance; use F-beta if you prefer recall (beta>1) or precision (beta<1).
For imbalanced data prefer PR curves and average precision over ROC AUC.

Expert take: metrics are not properties of models alone; they are dialogues between your model, your threshold, your data distribution, and the business cost function. Treat them like conversation, not commandments.

Closing — the tiny, dramatic moral

Metrics are maps, not territories. Accuracy is a lazy map that shows the highway but hides the cliffs. Precision and recall are the binoculars that let you spot what matters. F1 tries to be diplomatic when you're torn.

Remember your training in probabilistic modeling: make the probabilities honest (calibration), then pick a threshold that aligns with your real-world costs. If you ignore that, your 94% accuracy will crash the party — and you'll be the only one applauding.

Key takeaways:

Don't trust accuracy on imbalanced data.
Precision and recall answer different operational questions — pick the one that maps to your cost of mistakes.
Thresholds and calibration change everything; validate them.
Use F1 (or F-beta) when you want a single-scalar summary that balances concerns.

Go forth, calibrate those probabilities, and may your TP be plentiful and your FN few.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics