Machine Learning with scikit-learn
Build, tune, and evaluate models using scikit-learn pipelines with reproducible ML workflows.
Content
Classification Metrics
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Classification Metrics in scikit-learn — The Scoreboard That Actually Means Something
Remember when we split data and tuned pipelines? Good — because now we evaluate whether our model is a genius or a confused toaster. This is about classification metrics: the tools that translate predictions into decisions, business impact, and occasionally panic.
"A model's accuracy is its confidence; metrics are its conscience."
What this is (and why you care)
Classification metrics measure how well a model maps inputs to discrete labels. They matter because a single number (accuracy) often lies: in imbalanced datasets, trivial strategies (always guessing the majority class) can look great but be useless. You've seen data splits and CV strategies — now use the right metric when comparing models across cross-validation folds and pipelines.
Think back to Statistics and Probability for Data Science: false positives and false negatives are the practical offspring of Type I and Type II errors. Metrics turn those probabilities into numbers you can act on.
The essentials — quick tour
Confusion matrix (the origin story)
- True Positive (TP): predicted positive, actually positive
- False Positive (FP): predicted positive, actually negative (Type I)
- False Negative (FN): predicted negative, actually positive (Type II)
- True Negative (TN): predicted negative, actually negative
Micro explanation: the confusion matrix is just a 2x2 scoreboard. Everything else is derived from it.
Code (scikit-learn):
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_true, y_pred)
print(cm)
Accuracy
- (TP + TN) / total
- When it's useful: balanced classes, similar costs for mistakes.
- When it lies: class imbalance.
Precision and Recall (and the social dance between them)
- Precision = TP / (TP + FP)
- "When I predict positive, how often am I right?"
- Recall (Sensitivity) = TP / (TP + FN)
- "Of all positives, how many did I catch?"
Micro explanation: Precision is about trusting positives; recall is about finding positives. In medical tests, recall is often king (missing a sick person hurts).
from sklearn.metrics import precision_score, recall_score
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
F1 score — the compromise
- Harmonic mean of precision and recall.
- Use when you want a single number and both precision & recall matter.
from sklearn.metrics import f1_score
f1 = f1_score(y_true, y_pred)
Specificity (True Negative Rate)
- TN / (TN + FP)
- Useful when negatives are important to detect correctly (e.g., avoiding false alarms).
Balanced accuracy
- Mean of recall (sensitivity) and specificity — handy for imbalanced datasets.
Matthews Correlation Coefficient (MCC)
- A correlation coefficient between true and predicted labels — robust with imbalance. Interpretable: -1 to 1.
Ranking metrics: when probabilities matter
Often models output probabilities. Selecting a threshold changes TP/FP. Ranking metrics evaluate probability ordering, not a single threshold.
ROC AUC (Area Under ROC Curve)
- Plots True Positive Rate (recall) vs False Positive Rate across thresholds.
- AUC ranges from 0.5 (random) to 1.0 (perfect).
- Use when: class balance or costs unclear, and you care about ranking.
from sklearn.metrics import roc_auc_score, roc_curve
auc = roc_auc_score(y_true, y_score) # y_score = model.predict_proba(X)[:,1]
Precision-Recall AUC (PR AUC)
- More informative than ROC AUC for heavily imbalanced datasets where positives are rare.
- Focuses on precision vs recall trade-off.
from sklearn.metrics import average_precision_score, precision_recall_curve
ap = average_precision_score(y_true, y_score)
Micro explanation: ROC cares about FPR which can be tiny in imbalanced datasets and hide poor precision; PR AUC zeroes in on the positive class performance.
Thresholds, calibration, and business decisions
The default threshold of 0.5 is arbitrary. Change it to manage FP vs FN costs.
- Use precision-recall curves to pick a threshold that yields acceptable precision at desired recall.
- Check calibration: does predicted probability match observed frequency? Use calibration_curve or CalibratedClassifierCV.
from sklearn.calibration import calibration_curve
prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=10)
Practical tip: For production, bake thresholding into a pipeline step or use predict_proba -> custom decision rule. Keep thresholds tuned using cross-validation — remember CV strategies from earlier to avoid leakage.
Practical workflow with scikit-learn (short recipe)
- Train using pipelines (preprocessing + estimator). You already learned how to build those. Pipeline keeps transformations consistent.
- Use cross_val_score or cross_validate with scoring that matches your objective (e.g., 'average_precision', 'f1', 'roc_auc').
- Inspect confusion matrix on held-out test set. Visualize ROC and PR curves.
- If probabilities are important, check calibration and consider threshold tuning with cross-validation.
Example: cross-validated PR AUC
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipe, X, y, cv=5, scoring='average_precision')
print(scores.mean())
Edge cases & rules of thumb
- Imbalanced data? Favor precision/recall or PR AUC over raw accuracy.
- Business prefers minimizing false alarms? Maximize precision or specificity.
- Safety-critical systems (e.g., medical): prioritize recall/sensitivity, and always report uncertainty and calibration.
- Use MCC or balanced accuracy when you want a single robust metric for imbalanced problems.
Closing — key takeaways
- Metric choice matters more than model choice when class distribution and costs are misaligned.
- Always connect metrics to business costs: FP vs FN is not abstract — it's dollars, safety, or reputation.
- Use ROC AUC for balanced ranking tasks; use PR AUC for imbalanced positive-focused tasks.
- Combine metrics with calibration and threshold tuning; use CV strategies to avoid overfitting your metric.
Final insight: metrics are your translation layer between statistical intuition (you got that from the stats module) and real-world decisions. Train models, but evaluate them like human consequences depend on it.
Summary checklist
- Confusion matrix: compute and inspect
- Pick metric aligned to business goal
- Use PR AUC for rare positives
- Tune threshold with CV
- Check calibration before trusting probabilities
Happy evaluating. Make metrics your friend — not just a number that looks pretty on a slide.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!