Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

Noisy Labels and Annotation Quality Out-of-Distribution Detection Data Leakage from Temporal Effects Drift Detection and Adaptation Rare Events and Positive-Unlabeled Data High Cardinality Categorical Features Skewed Targets in Regression Missing Not at Random Mechanisms Data Augmentation for Tabular Data Weak Supervision and Distant Labels Semi-Supervised Add-ons to Supervised Privacy-Preserving Feature Engineering Federated Learning Basics for Supervised Small Data and High-D Variants Shortcut Learning and Spurious Correlation

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Handling Real-World Data Issues

Handling Real-World Data Issues

26086 views

Tackle noise, drift, imbalance, and other practical dataset challenges in production-like settings.

Content

5 of 15

Rare Events and Positive-Unlabeled Data

PU Problems but Make It Snappy

1234 views

intermediate

humorous

machine learning

visual

gpt-5-mini

1234 views

Versions:

PU Problems but Make It Snappy

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Rare Events and Positive-Unlabeled Data — The Needle-in-a-Haystack Special

Imagine you’re a fraud analyst whose dataset is basically one tiny neon sign that says "FRAUD" and a giant fog of "maybe". Welcome to rare events and positive-unlabeled (PU) problems — where the negatives are shy, the positives are proud, and the unlabeled are just confused.

This sits naturally after our chat about tree-based models and ensembles (remember: random forests love balanced data but will happily hallucinate signal when fed garbage) and ties into drift detection and temporal leakage concerns. If your positives are rare and drifting over time, and your labeling process leaks future info, you’ll get optimistic models that explode in production. Let’s fix that before your AUC deceives you into ruin.

What are Rare Events and PU data, and why should you care?

Rare events: target class frequency is tiny (think 0.1% — 1 in a thousand). Examples: fraud, some medical conditions, catastrophe prediction.
Positive-Unlabeled (PU) data: you have confidently labeled positives and a massive pool of unlabeled examples — unlabeled contains both positives and negatives, but the negatives aren't explicitly labeled.

Why this matters: classic supervised training assumes labeled positives and negatives. When negatives are unlabeled, naive approaches (treating unlabeled as negative) introduce bias and kill generalization. Ensembles can amplify that bias if you don't correct for it.

The core conceptual pitfalls (aka what will make your model lie)

Label bias from selective labeling (SCAR vs. non-SCAR)
- SCAR = Selected Completely At Random — the assumption that positives are labeled randomly among positives. If false, you need to model selection.
Class prior unknown — you usually don’t know what fraction of unlabeled are actually positive.
Evaluation blindness — accuracy/AUC on heuristic negative labels is meaningless; metrics must reflect uncertainty.
Temporal/selection leakage — if labeling rules change over time, your model picks up the change, not the signal.

Quick tip: If positives are labeled more often when they’re obvious, your model learns to detect obviousness, not the underlying rare condition.

Practical toolbox: methods you can use (ranked by how principled vs quick-and-dirty they are)

1) Heuristics and data-level fixes (fast, common, risky)

Treat unlabeled as negatives (easy but biased)
Downsample unlabeled to get class balance (helps models but doesn’t fix label noise)
Pseudo-labeling: iteratively label high-confidence unlabeled as negative; dangerous if initial model is biased

2) PU-specific learning (principled)

Two-step methods (Spy, Elkan-Noto)
- Use a small subset of positives as "spies" mixed into unlabeled to estimate probability of being negative.
Unbiased PU (uPU) risk estimators
- Estimate risk using positive and unlabeled sets and an estimate of the class prior. Works well but can give negative empirical risks.
Non-negative PU (nnPU)
- A variant of uPU that clips negative risk to zero — more stable with flexible models (neural nets/boosting).
EM-style approaches
- Treat true class as latent and optimize jointly for labels and classifier.

3) One-class / anomaly detection

One-class SVM, isolation forest, autoencoders — treat positives as the only class and detect outliers. Works if positives have coherent structure.

4) Semi-supervised and active learning

Active labeling: query the most informative unlabeled items for human labels.
Label propagation: use graph-based methods to spread label info across similar examples.

Integrating with tree-based models and ensembles

Tree ensembles are flexible and can be adapted rather than replaced.

Sample weighting & calibrated losses: estimate class prior π (fraction positives in population) and assign sample weights or adjust loss to reflect PU risk estimators.
Bagging with biased sampling: create many classifiers trained against different unlabeled subsamples; ensemble averages reduce variance of incorrect negatives.
Gradient boosting with custom objective: implement uPU/nnPU loss as the boosting objective so each tree optimizes the PU-aware risk.
Pseudo-label carefully: only pseudo-label with very high-confidence predictions and validate using a holdout of labeled positives to avoid confirmation bias.

Code sketch (pseudo):

# Pseudocode: simple nnPU-ish training loop with tree ensemble
estimate_pi = estimate_class_prior(positives, unlabeled)
for round in 1..T:
    compute_pu_risk = positive_loss * pi + unlabeled_loss - pi * negative_loss_estimate
    gradient_step_with_tree(objective=nnPU_loss)
    clip_negative_risk_to_zero()

(Full implementations need careful numeric handling — but this shows where you hook into gradient boosting.)

Estimating the class prior (π)

Reliable π estimation is crucial. Methods:

EM-based estimation — alternate between estimating labels and π
Mixture proportion estimation — use density ratios or ROC-based techniques (Elkan & Noto) to estimate π
Anchor/spy methods — label a random sprinkling of positives as spies and measure recovery rate

Without a good π, your calibrated probabilities will be garbage.

Evaluation: how to know if your model is actually good?

Use precision@k if you care about top-ranked items (common in fraud)
Track recall on held-out positives (you must keep a verified positive holdout)
Compute lower/upper bounds for performance using estimated π
Prefer business metrics (expected losses, cost-sensitive evaluation) over generic accuracy

Table: quick comparison

Method Type	Pros	Cons
Heuristic labeling	Fast, simple	Biased, can mislead ensembles
PU-specific (nnPU)	Principled, unbiased in theory	Needs π, numeric tricks
One-class	Good for coherent positives	Fails if positives are diverse
Active learning	Data-efficient	Requires human labeling budget

Practical recipe (do this, not that)

Keep a holdout of verified positives for evaluation.
Try to understand labeling process — is it SCAR? If not, model selection bias.
Estimate the class prior π before training.
Start with a PU-specific method (nnPU or Elkan-Noto) or adapt your booster's loss.
Use ensembles with subsampling/weighting, not naive negative labeling.
Monitor precision@k, recall on held-out positives, and business impact.
If possible, use active learning to get a small batch of labeled negatives — it often pays off.

Closing — TL;DR with a tiny pep talk

Rare events + unlabeled negatives = setup for overconfident disasters unless you treat the unlabeled properly.
Don’t pretend unlabeled = negative. Estimate class prior, use PU-aware loss, and evaluate with metrics that matter.
Tree ensembles are players, not magicians: adapt sampling/weights or the loss, and use ensembles to stabilize noisy assumptions.

Final mental image: your model should be a cautious detective, not a gossiping neighbor. Be skeptical, quantify uncertainty, and ask for a little more labeled truth when the stakes are high.

Ready to take this into practice? Next step: a runnable notebook showing nnPU with XGBoost-style custom objective and a class-prior estimator — want that?

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics