jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Supervised Machine Learning: Regression and Classification
Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

Confusion Matrix AnatomyAccuracy, Precision, Recall, F1ROC Curves and AUCPrecision–Recall Curves and AUC-PRThreshold Selection StrategiesCost Curves and Expected UtilityProbability Calibration MethodsBrier Score and Log LossMulticlass Metrics and AveragingRanking Metrics for Imbalanced DataTop-k and Coverage MetricsMacro vs Micro vs Weighted ScoresCumulative Gain and Lift ChartsCalibration Plots and ReliabilityDecision Curves and Net Benefit

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Classification II: Thresholding, Calibration, and Metrics

Classification II: Thresholding, Calibration, and Metrics

32343 views

Make cost-aware decisions by selecting thresholds, calibrating probabilities, and using the right metrics.

Content

2 of 15

Accuracy, Precision, Recall, F1

Metrics with Sass and Substance
7997 views
intermediate
humorous
machine learning
gpt-5-mini
7997 views

Versions:

Metrics with Sass and Substance

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Classification II — Accuracy, Precision, Recall, F1

You already know how logistic regression gives you probabilities and how a confusion matrix lays out our sins and virtues. Now let's stop pretending a single number tells the whole truth.


Hook: The Diet Coke Thought Experiment

Imagine your classifier is a friend who promises you snacks. It says, with 80% confidence, "I will bring you a Diet Coke." Sometimes it does, sometimes it brings you chips, sometimes nothing. You must decide: do you trust that 80% and wait, or do you go buy your own drinks? Metrics are the baked-in disappointment detector and the trust-meter rolled into one.

You learned earlier that logistic regression gives probabilistic outputs. Those probabilities need to be turned into decisions. How good are those decisions? That's where accuracy, precision, recall, and F1 live.


Quick refresher (no rehashing the entire confusion-matrix lecture)

Here's the minimal reminder — the confusion matrix counts:

Predicted \ Actual Positive (P) Negative (N)
Predicted Positive True Positive (TP) False Positive (FP)
Predicted Negative False Negative (FN) True Negative (TN)

You saw this in Classification II: Confusion Matrix Anatomy. Now spawn the metrics.


The Metrics: formulas and plain-English

  • Accuracy: fraction correct.
accuracy = (TP + TN) / (TP + TN + FP + FN)
  • Precision (also called positive predictive value): "Of the things we predicted positive, how many actually were positive?"
precision = TP / (TP + FP)
  • Recall (a.k.a. sensitivity, true positive rate): "Of all actual positives, how many did we catch?"
recall = TP / (TP + FN)
  • F1 score: harmonic mean of precision and recall — punishes lopsidedness.
F1 = 2 * (precision * recall) / (precision + recall)

Why harmonic mean? Because if precision is 0.99 and recall is 0.01, the arithmetic mean would lie to you. The harmonic mean forces the score to reflect the bottleneck.


When accuracy lies (and why it will seduce you)

Imagine a fraud detection dataset where only 1% of transactions are fraudulent. A model that always predicts "not fraud" gets 99% accuracy. Congratulations, it's a master at being useless.

Accuracy is a blunt instrument when classes are imbalanced. It rewards majority-class guessing and hides the model's inability to find the rare, important class.

Ask yourself: what do I actually care about? False alarms or missed alarms? That determines whether you look at precision or recall.


Precision vs Recall: the legal drama

  • High precision, low recall = you only shout 'guilty' when you're sure. Few false positives, many false negatives. Useful when false alarms are costly (e.g., banning users wrongly).
  • High recall, low precision = you shout 'guilty' often, capturing more real criminals but also scaring many innocents. Useful when missing a positive is costly (e.g., disease screening).

Think of it as a courtroom: precision is the prosecutor's accuracy when they accuse; recall is the court's ability to find all guilty people.


Thresholding — where the rubber meets the road

Logistic regression gives you P(y=1|x). To turn probability into a label you pick a threshold t (default 0.5). But this is your lever.

  • Lower t => more positives => recall goes up, precision may go down.
  • Higher t => fewer positives => precision goes up, recall may go down.

This trade-off is continuous: sweep t from 0 to 1 and trace precision and recall. That's the precision-recall curve. If classes are imbalanced, PR curves give more informative views than ROC curves.

Quick code sketch (pseudocode):

for t in linspace(0,1,100):
    preds = probs >= t
    compute precision, recall
plot precision vs recall

Question: what threshold gives you the best F1? Often you choose the t that maximizes F1 on validation data. But remember: F1 implicitly treats precision and recall as equally important — which isn't always your world.


Calibration: probabilities you can trust

Calibration answers: when you say 0.8, is it actually 80%? Logistic regression is often fairly well-calibrated, but not always — especially in high-dimensional sparse settings or if you've overfit (remember Classification I: Sparse High-Dimensional and Overfitting issues). Overfitting ruins calibration: probabilities become overconfident.

Calibration matters because thresholding decisions assume the predicted probabilities are meaningful. If your model is miscalibrated, picking t based on those probabilities is like trusting a broken thermostat.

Tools:

  • Reliability diagrams / calibration plots
  • Platt scaling, isotonic regression to recalibrate

Worked example (small, satisfying math)

Suppose 1000 samples, 50 positives (disease). Your model at threshold 0.5 yields: TP=30, FN=20, FP=40, TN=910.

Compute:

  • Accuracy = (30 + 910) / 1000 = 0.94
  • Precision = 30 / (30 + 40) = 0.4286
  • Recall = 30 / (30 + 20) = 0.6
  • F1 = 2 * 0.4286 * 0.6 / (0.4286 + 0.6) ≈ 0.5

So 94% accuracy sounds great; precision 42% tells another story. If positive is disease, 42% means more than half flagged as sick are fine — that may be unacceptable.


Practical checklist (what to do in your model pipeline)

  1. Always inspect confusion matrix and class balance before worshiping accuracy.
  2. Choose metrics based on cost of FP vs FN (precision vs recall).
  3. Sweep thresholds and plot precision-recall curves; choose threshold on validation set, not test.
  4. Check calibration. If probabilities are off, recalibrate before thresholding.
  5. Use F1 when you care about a balance; use F-beta if you prefer recall (beta>1) or precision (beta<1).
  6. For imbalanced data prefer PR curves and average precision over ROC AUC.

Expert take: metrics are not properties of models alone; they are dialogues between your model, your threshold, your data distribution, and the business cost function. Treat them like conversation, not commandments.

Closing — the tiny, dramatic moral

Metrics are maps, not territories. Accuracy is a lazy map that shows the highway but hides the cliffs. Precision and recall are the binoculars that let you spot what matters. F1 tries to be diplomatic when you're torn.

Remember your training in probabilistic modeling: make the probabilities honest (calibration), then pick a threshold that aligns with your real-world costs. If you ignore that, your 94% accuracy will crash the party — and you'll be the only one applauding.

Key takeaways:

  • Don't trust accuracy on imbalanced data.
  • Precision and recall answer different operational questions — pick the one that maps to your cost of mistakes.
  • Thresholds and calibration change everything; validate them.
  • Use F1 (or F-beta) when you want a single-scalar summary that balances concerns.

Go forth, calibrate those probabilities, and may your TP be plentiful and your FN few.

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics