Foundations of AI and Data Science
Core concepts, roles, workflows, and ethics that frame end‑to‑end AI projects.
Content
Metrics and evaluation basics
Versions:
Watch & Learn
AI-discovered learning video
Metrics & Evaluation Basics: Stop Measuring With Vibes
If problem framing was the question and data types were the alphabet, metrics are the grading rubric. And you cannot complain about your grade if you never read the rubric.
We already learned how to name the assignment (problem framing) and what language the data speaks (data types and formats). Now we ask the question every model dreads: did you actually do the thing? Welcome to metrics and evaluation — the part where we quantify success so precisely that even your most overconfident model has to sit down and reflect.
Why Metrics Matter (aka, congratulations on your 99% accuracy spam filter that lets all the spam through)
A model can be:
- technically impressive
- fast
- 22-layers deep and wearing a cape
...and still be useless if you're optimizing the wrong number. Metrics align the model with the goal you set during problem framing. Choose badly, and you'll be optimizing vibe-based nonsense. Choose wisely, and you get progress that actually matters.
Golden rule: the metric must match the decision you care about, the data type you have, and the cost of being wrong.
From Problem Type to Metric: the Translator Table
Remember our data types: numeric, categorical, text, images, timestamps, etc. The label's type largely determines the metric family.
| Task type | Primary metrics | When they lie to you |
|---|---|---|
| Binary classification | accuracy, precision, recall, F1, ROC-AUC, PR-AUC, log loss, calibration error | class imbalance makes accuracy smug; ROC can look good with rare positives; threshold choices change everything |
| Multiclass classification | accuracy, macro/micro F1, top-k accuracy, log loss | rare classes get ignored with micro averages; top-k can hide confusion |
| Regression | MAE, MSE/RMSE, R^2, MAPE/sMAPE, quantile loss | MSE overreacts to outliers; MAPE explodes at or near zero |
| Ranking/recommendation | Precision@k, Recall@k, MAP, NDCG | ignores long-tail utility; position bias matters |
| Clustering (no labels) | silhouette, Davies–Bouldin, Calinski–Harabasz | distance metrics assume meaningful geometry; scale/feature choice change everything |
| Clustering (with labels) | ARI, NMI | sensitive to label noise; not business-aligned |
| Time series forecasting | MAE, RMSE, MAPE/sMAPE, MASE | temporal leakage ruins lives; seasonality breaks naive baselines |
| Anomaly detection | PR-AUC, ROC-AUC, Precision@k | positives are rare; PR-AUC often more honest |
Classification: Welcome to the Confusion Family Drama
First, the confusion matrix, aka receipts:
| Predicted: Positive | Predicted: Negative | |
|---|---|---|
| Actual: Positive | TP | FN |
| Actual: Negative | FP | TN |
Key metrics:
- Accuracy = (TP + TN) / all. Great when classes are balanced. A chaos gremlin when not.
- Precision = TP / (TP + FP). Of predicted positives, how many were right?
- Recall = TP / (TP + FN). Of actual positives, how many did we catch?
- F1 = harmonic mean of precision and recall. Your no-fights-at-the-dinner-table compromise.
- Specificity = TN / (TN + FP). Love this for fraud/risk when false alarms are expensive.
- Balanced accuracy = (Recall + Specificity)/2. Sanity for imbalanced data.
- Log loss (cross-entropy): punishes overconfident wrongness. Great for probabilistic models.
- Calibration error (ECE): measures whether predicted probabilities match reality. If you say 0.7, does it happen 70% of the time?
Quick formulas:
precision = TP / (TP + FP)
recall = TP / (TP + FN)
F1 = 2 * (precision * recall) / (precision + recall)
balanced_accuracy = (TPR + TNR) / 2
log_loss = - mean( y*log(p) + (1-y)*log(1-p) )
ROC-AUC vs PR-AUC:
- ROC-AUC: ranks positives above negatives regardless of threshold. Can look great when positives are rare because false positives barely move the needle.
- PR-AUC: focuses on the quality of your positive predictions. In imbalanced settings, this is the metric that sees through your nonsense.
Thresholds change your precision/recall. Do not report a single accuracy without saying what threshold you used. Better: tune it to your costs.
# toy threshold tuning for binary classification
for t in np.linspace(0,1,101):
preds = (p >= t)
cost = FP_cost * FP + FN_cost * FN
pick t that minimizes cost
Macro vs Micro averaging (multiclass):
- Micro: pool all predictions; large classes dominate.
- Macro: average per-class metrics; small classes get a voice.
If your minority class matters (it usually does), macro metrics tell you if the model is failing the quiet kids in the back row.
Regression: Choosing Your Flavor of Regret
When labels are continuous, your metric should reflect what hurts:
- MAE: average absolute error. Robust to outliers, reads like dollars or units. Chef's kiss for interpretability.
- MSE/RMSE: squares errors, so big mistakes scream louder. Great when large errors are catastrophic.
- R^2: fraction of variance explained (but can mislead with nonlinearity or no intercept).
- MAPE/sMAPE: percent error. Beware zeros; your metric will do a backflip into infinity.
- Quantile loss: optimize medians or other quantiles for asymmetric costs.
Formulas:
MAE = mean(|y - y_hat|)
MSE = mean((y - y_hat)^2)
RMSE = sqrt(MSE)
R2 = 1 - SS_res / SS_tot
MAPE = mean(|y - y_hat| / |y|) * 100%
Pro tips:
- If your stakeholders think in percentages, sMAPE or MASE might be saner than MAPE.
- If outliers are real pain (miss a forecast by 10x and the warehouse cries), RMSE makes sense.
Unsupervised and Ranking: When Labels Ghost You
Clustering (no labels):
- Silhouette: how close points are to their own cluster vs others. Needs meaningful distance; scaling matters.
- Davies–Bouldin: lower is better; measures cluster separation and compactness.
If you secretly have labels: ARI/NMI compare clustering to ground truth; still may not match business value.
Ranking/Recommendation:
- Precision@k: of the top k items, how many were relevant?
- Recall@k: of all relevant items, how many made the top k?
- MAP: mean average precision across queries. Rewards good early ranking.
- NDCG: like MAP but weights early positions more and handles graded relevance.
For recommenders, utility lives at the top of the list. If you optimize global accuracy, you might be proudly mediocre at everything.
Evaluation Protocols: The Lab Safety Rules
- Train/validation/test split: the sacred trio. The test set is for the end. Do not peek.
- Stratification: keep class proportions consistent across splits.
- Time series split: train on the past, validate on the future. Random shuffling here is temporal heresy.
- Cross-validation: K-fold for small data, stratified for classification. Use nested CV for hyperparameter tuning to avoid optimism.
- Baselines: always compare against a trivial baseline (majority class, mean predictor, seasonal naive, popularity). Beat it or rethink life choices.
- Data leakage: any artifact from future or target leaking into features. Common culprits: target-encoded categories without proper CV, normalizing with global stats, peeking at the test set.
- Uncertainty: report confidence intervals. Bootstrap if needed.
# bootstrap a confidence interval for your metric
for b in 1..B:
sample indices with replacement
compute metric on the resample
CI = percentile(metric_values, [2.5, 97.5])
Offline vs online:
- Offline metrics get you to launch-ready.
- Online A/B tests measure what actually matters (clicks, conversions, revenue, risk). Pre-register your metric. Stop peeking.
No metric is truly real until it survives production.
Cost, Utility, and Custom Metrics: Make the Math Match the Money
During problem framing, you listed what hurts: false positives, false negatives, delays, wasted compute. Convert that into a cost matrix and optimize for expected cost.
expected_cost = FP_cost * FP + FN_cost * FN + TP_cost * TP + TN_cost * TN
maximize_utility = - expected_cost
Examples:
- Medical screening: recall > precision; missing a case is expensive.
- Moderation: precision > recall; false alarms upset users.
- Fraud: both hurt; tune threshold by scenario or segment.
Calibration matters for decision thresholds. A calibrated model with modest AUC can beat a fancy but overconfident one when you care about expected value.
Fairness, Robustness, and Other Ways Reality Fights Back
- Group metrics: compare accuracy, recall, or calibration across demographics. Gaps hint at bias.
- Fairness metrics (quick taste): demographic parity difference, equalized odds (equal TPR/FPR across groups), equal opportunity (equal TPR). These conflict; choose based on values and law.
- Robustness: test under distribution shift, missing data, or noise. Your evaluation should reflect the wild outdoors, not the lab terrarium.
- Data quality: label noise lowers max achievable metrics. Consider label audits or noise-robust losses.
A perfect metric on flawed data is just a very precise wrong answer.
Common Traps and How to Dodge Them
- Reporting accuracy on imbalanced data. Use PR-AUC, F1, or class-balanced metrics.
- Tuning on the test set. Congrats, you optimized to the final exam key.
- Ignoring variance. Report CIs; know when improvements are just noise.
- One-metric-itis. Use a dashboard: performance, calibration, fairness, and latency.
- Using MAPE with zeros. Please do not divide by zero; consider sMAPE or MAE.
- Comparing across incomparable splits. Keep splits consistent; fix a random seed.
Quick Recipe: From Problem to Metric (and Sanity)
- Restate the decision and cost from problem framing.
- Identify label/data types. That suggests the metric family.
- Pick primary and secondary metrics. Include calibration for probabilistic decisions.
- Define evaluation protocol (splits, CV, baselines). Lock it before training.
- Tune thresholds to minimize expected cost, not maximize vibes.
- Report metrics with confidence intervals and per-group breakdowns.
- Validate offline, then online. Keep monitoring.
TL;DR (Tattoo This on Your Dataset)
- Metrics are how models talk to business goals.
- Choose metrics that reflect costs, data types, and deployment reality.
- Protocols matter as much as formulas; leakage and bad splits will betray you.
- Baselines are your grounding wire; beat them convincingly.
- Calibration, fairness, and confidence intervals turn good models into trustworthy systems.
Final thought: The right metric is not the one that makes your model look best. It is the one that makes your decisions safest, cheaper, and kinder to your users.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!