Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

Noisy Labels and Annotation Quality Out-of-Distribution Detection Data Leakage from Temporal Effects Drift Detection and Adaptation Rare Events and Positive-Unlabeled Data High Cardinality Categorical Features Skewed Targets in Regression Missing Not at Random Mechanisms Data Augmentation for Tabular Data Weak Supervision and Distant Labels Semi-Supervised Add-ons to Supervised Privacy-Preserving Feature Engineering Federated Learning Basics for Supervised Small Data and High-D Variants Shortcut Learning and Spurious Correlation

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Handling Real-World Data Issues

Handling Real-World Data Issues

26086 views

Tackle noise, drift, imbalance, and other practical dataset challenges in production-like settings.

Content

2 of 15

Out-of-Distribution Detection

OOD: The Slightly Paranoid Lab Partner

5871 views

intermediate

humorous

machine learning

sarcastic

gpt-5-mini

5871 views

Versions:

OOD: The Slightly Paranoid Lab Partner

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Out-of-Distribution Detection — When Your Model Sees a Unicorn and Panics

Your classifier was trained on zebras and horses. Now it meets a unicorn. Does it say "horse" confidently? Or does it at least get suspicious?

We already learned how to tame trees and make ensembles sing (remember stacking, blending, and calibration?). Now we get to the paranoid but necessary sibling: out-of-distribution (OOD) detection — the art of telling your model to stop and say "I do not know this" before it confidently misbehaves in production.

Why this matters (practical elevator pitch):

Models deployed in the wild face datapoints that differ from training data in subtle or dramatic ways.
Bad OOD handling = wrong predictions + overconfident garbage = real-world harm.
OOD detection complements calibration and ensemble strategies we covered earlier: calibrated probabilities are helpful, but calibration alone does not guarantee awareness of novel contexts.

What is OOD (and what it is not)

Covariate shift (input distribution changes) and concept shift (labeling function changes) are cousins of OOD, but OOD focuses on inputs that do not resemble training points.
OOD detection aims to assign a score s(x) such that higher s means "more likely OOD". Then we threshold: if s(x) > tau, abstain or route to human.

Ask yourself: why do people keep misunderstanding this? Because many assume softmax low confidence implies novelty. Spoiler: softmax lies.

A quick taxonomy of detection strategies

Density & distance in input space
- Kernel density estimation, Gaussian mixture models, Mahalanobis distance, Local Outlier Factor (LOF), Isolation Forest
Representation-based / feature-space distance
- Use penultimate-layer features from a neural net, or leaf-activation vectors from tree ensembles
Uncertainty-based methods
- Monte Carlo dropout, deep ensembles, Bayesian neural nets
Reconstruction-based
- Autoencoders and PCA: high reconstruction error suggests novelty
Post-hoc softmax tweaks
- Temperature scaling + input perturbations (ODIN), energy-based scoring
Meta / supervised OOD detection
- Train a binary classifier on in-distribution vs proxy OOD examples; stacking/blending can combine detectors

How this plugs into tree-based models and noisy labels

For tree ensembles (random forest, gradient boosting): you can use leaf index embeddings or the distribution of votes as features for an OOD detector. Single-tree probability estimates are poorly calibrated; recall our calibration discussion — calibrating ensembles improves confidence estimates, which helps, but calibration does not equal OOD detection.
Ensembles help: deep ensembles or an ensemble of diverse detectors increases robustness. You can stack multiple OOD scores into a meta-detector — a neat place to reuse stacking/blending knowledge.
Noisy labels and annotation quality: OOD datapoints often correspond to annotation disagreements or mislabeled items. If an example is flagged as OOD and also has low annotator agreement, route it to relabeling.

Practical methods and when to use them

Method	Works well for	Pros	Cons
Mahalanobis distance in feature space	Models with meaningful embeddings (deep nets)	Simple, fast, interpretable	Needs class-conditional statistics; assumes Gaussianity
Isolation Forest / LOF	Tabular data with heterogenous features	Unsupervised, no training labels needed	Sensitive to scaling, high-dim issues
Autoencoder reconstruction	High-dim continuous inputs (images)	Intuitive; unsupervised	Can reconstruct OOD if powerful; not always reliable
Deep ensembles / MC dropout	Any neural net	Good uncertainty estimates	Computationally heavier
Supervised OOD classifier	When you can collect proxy OOD	Often strong	Requires proxy OOD that matches real-world surprises

A simple recipe: feature-space Mahalanobis OOD detector (pseudocode)

# Given: trained model f, dataset X_train with class labels y_train
# 1. Extract features z = penultimate_layer(f, x) for train set
# 2. For each class c compute mean mu_c and shared covariance Sigma
# 3. For new x: z_new = penultimate_layer(f, x)
#    score = min_c ( (z_new - mu_c)^T Sigma^{-1} (z_new - mu_c) )
# higher score -> more likely OOD

Why it works: you're saying "how close is this test point in representation space to any trained class center?" If it's far from all, it's suspicious.

Evaluation: how do we measure OOD detectors?

Area Under ROC (AUROC) between in-distribution and OOD scores
False Positive Rate at 95% True Positive Rate (FPR@95TPR) — popular in literature
Precision-Recall if OODs are rare

Important: evaluate on realistic OOD data. Toy OOD (e.g., random noise) is uninformative.

Checklist for building OOD capability (practical workflow)

Baseline: test if naive softmax probability already fails — it usually does.
Choose detection family based on data: isolation forest / LOF for tabular; Mahalanobis or autoencoder for images/text embeddings.
If you already use ensembles, extract disagreement/variance as a detector input — stacking these signals can be powerful.
Calibrate outputs (temperature scaling) — it helps downstream decisions, but not a full solution.
If possible, collect proxy OODs to train a supervised detector or to validate thresholds.
Route flagged OODs for human review, fallback models, or explicit abstention.

Common pitfalls and how to avoid them

Assuming low softmax = OOD. No. Softmax is a liar when asked about novelty.
Using density in raw input for high-dimensional data. Curse of dimensionality bites. Use learned features.
Evaluating on simplistic OOD datasets. Test with the kinds of novelties your production system will face.
Not connecting OOD detection to operations. A detector without a response strategy is just an alarm bell with no firefighter.

Closing rant / motivational mic drop

OOD detection is less glamorous than training a huge model but far more honest: it admits what you do not know. Combine representation-aware distances, calibrated uncertainties, and ensemble disagreement — then make sure your system has a plan for flagged inputs (human review, fallback rule, or safe abstention). Finally, remember: models that know their ignorance are models you can trust in the messy human world.

Key takeaways:

OOD detection is essential in production and complements calibration and ensembling strategies.
Use the right tool for your data: distance/density, reconstruction, or uncertainty-based methods.
Always evaluate OOD methods on realistic OOD examples and integrate them into a decision flow.

If your model could say one honest sentence before causing trouble, make sure it does. Better: make it say several and then call for backup.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics