Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

Confusion Matrix Anatomy Accuracy, Precision, Recall, F1 ROC Curves and AUC Precision–Recall Curves and AUC-PR Threshold Selection Strategies Cost Curves and Expected Utility Probability Calibration Methods Brier Score and Log Loss Multiclass Metrics and Averaging Ranking Metrics for Imbalanced Data Top-k and Coverage Metrics Macro vs Micro vs Weighted Scores Cumulative Gain and Lift Charts Calibration Plots and Reliability Decision Curves and Net Benefit

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Classification II: Thresholding, Calibration, and Metrics

Classification II: Thresholding, Calibration, and Metrics

32355 views

Make cost-aware decisions by selecting thresholds, calibrating probabilities, and using the right metrics.

Content

1 of 15

Confusion Matrix Anatomy

Confusion Matrix: The Little Table That Tells Big Truths

6324 views

intermediate

humorous

visual

machine learning

gpt-5-mini

6324 views

Versions:

Confusion Matrix: The Little Table That Tells Big Truths

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Confusion Matrix Anatomy — The Diagnostic Table That Tells You If Your Model Is a Hero or a Villain

"Probabilities are great. But when you pull the emergency brake and pick a threshold, everything either lives or dies — welcome to the confusion matrix."

You already learned how logistic regression hands you probabilities (see: Classification I). Nice, soft numbers between 0 and 1. But real-world decisions usually want actions: accept, reject, alarm, send to manual review. The confusion matrix is the little table that records what happens when probabilistic models make decisions — and it’s the backbone of nearly every classification metric you’ll meet.

What is a confusion matrix? Quick sketch

A confusion matrix is a 2×2 contingency table for binary classification that counts outcomes when predictions meet reality.

Actual \ Predicted	Positive (1)	Negative (0)
Positive (1)	True Positive (TP)	False Negative (FN)
Negative (0)	False Positive (FP)	True Negative (TN)

TP: Model said "yes" and reality was "yes".
TN: Model said "no" and reality was "no".
FP: Model cried wolf — predicted "yes" but it was "no".
FN: Model missed it — predicted "no" but it was actually "yes".

Imagine a COVID test or a fraud detector. A false negative might be someone infected who gets sent home; a false positive might be an innocent person flagged for quarantine.

Why the confusion matrix matters (more than accuracy alone)

If your dataset has 99% negatives and your model predicts "negative" always, you get 99% accuracy. Party time? No. This is the accuracy paradox. The confusion matrix reveals the imbalance hidden by accuracy.

Consider these derived quantities (the usual suspects):

Accuracy = (TP + TN) / (TP + FP + TN + FN)
Precision (Positive Predictive Value) = TP / (TP + FP)
Recall (Sensitivity, True Positive Rate) = TP / (TP + FN)
Specificity (True Negative Rate) = TN / (TN + FP)
F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

Each of these is just a function of the four cells. Think of them as different drugs you can take depending on whether you want to be conservative (precision) or inclusive (recall).

Anatomy walk-through — cell by cell (with analogies)

TP — The model correctly detects a fraud. Gold star. But not everything should be TP maximized.
TN — Correctly ignores a normal transaction. Silent victory.
FP — False alarm. It's the "crying wolf" tax. Cost depends on context: UX annoyance vs. expensive manual review.
FN — The true silent killer. Missed fraud, missed disease. Often the most costly.

Ask yourself: which is worse in your problem — FP or FN? That drives thresholds and metrics.

From probabilities to cells: thresholding

Your logistic regression gives p(y=1 | x). To make the confusion matrix you pick a threshold t and predict positive if p >= t.

Pseudo-Python:

# probs: array of model probabilities
# y_true: array of true labels (0/1)
threshold = 0.5  # arbitrary default
y_pred = (probs >= threshold).astype(int)
confusion = confusion_matrix(y_true, y_pred)
# returns [[TN, FP], [FN, TP]] in many libraries

Vary t from 0 to 1 and the (TP, FP, TN, FN) move like tectonic plates — that's the essence of ROC and Precision-Recall curves.

Prevalence matters — class imbalance and base rates

Prevalence (also called base rate) = (TP + FN) / N = proportion of positives in population.

High/low prevalence changes interpretation:

With very low prevalence, precision tends to suffer even with decent recall — many positives mean many false alarms from the many negatives.
Balanced accuracy or Matthews Correlation Coefficient (MCC) are useful when classes are imbalanced because they combine sensitivity and specificity in a more symmetric way.

Quick formulas:

Balanced Accuracy = (Sensitivity + Specificity) / 2
MCC = (TPTN - FPFN) / sqrt((TP+FP)(TP+FN)(TN+FP)(TN+FN))

MCC ranges from -1 (total disagreement) to +1 (perfect prediction). Use it when you want a single number that respects imbalance.

Example confusions: a tiny table that tells stories

Suppose N=1000, prevalence 5% (50 positives). Two models:

Model A: Always predict negative -> TP=0, FN=50, TN=950, FP=0 -> Accuracy = 95% (still terrible at the job)

Model B: Picks up 40 of the 50 positives but flags 200 negatives -> TP=40, FN=10, FP=200, TN=750 ->

Precision = 40 / (40+200) = 0.167
Recall = 40 / 50 = 0.8

Which is better? Depends on cost. If missing a positive is catastrophic, Model B (high recall) is better despite low precision. If every false alarm costs money, maybe not.

Calibration meets the confusion matrix

You learned about models that output probabilities (Classification I). Calibration asks: do those probabilities mean anything? If a model outputs 0.8 for a hundred cases, do ~80 of them truly belong to the positive class?

A calibrated model makes thresholding meaningful: you can set a threshold using expected costs. An uncalibrated model might rank examples OK (so ROC AUC is decent) but its absolute probabilities are misleading for decision thresholds.

Practical note: calibrate when you need reliable probabilities (Platt scaling, isotonic regression). If you just need ranking, calibration isn't strictly necessary.

Quick decision guide: what metric to use

You care about catching positives (e.g., disease screening): Maximize recall; monitor precision; use F1 if you want balance.
You care about avoiding false alarms (e.g., expensive manual review): Maximize precision; track recall.
Data heavily imbalanced: Use AUC-ROC for ranking; PR-AUC or MCC or Balanced Accuracy for thresholds.

Closing: The moral of the matrix

The confusion matrix is small but mercilessly honest. It forces you to name the costs of errors and to choose thresholds intentionally. Probabilities from logistic regression are powerful — but turning them into actions without consulting the matrix is like sending troops into battle with a coin flip.

"Metrics are not just numbers — they're moral choices in disguise. Which errors are you willing to live with?"

Key takeaways:

Build the confusion matrix after you threshold probabilities; it's the source for all classification metrics.
Accuracy can lie — always inspect TP, FP, TN, FN.
Thresholding links probabilities to action; calibration makes those probabilities trustworthy.
Choose metrics based on domain cost: precision vs recall vs a balanced view (F1, MCC).

Go forth: pick a threshold, compute the matrix, and ask the hard question — what happens if I get this wrong? Your model (and stakeholders) will thank you — or sue you. Either way, you'll know why.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics