jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Supervised Machine Learning: Regression and Classification
Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

Noisy Labels and Annotation QualityOut-of-Distribution DetectionData Leakage from Temporal EffectsDrift Detection and AdaptationRare Events and Positive-Unlabeled DataHigh Cardinality Categorical FeaturesSkewed Targets in RegressionMissing Not at Random MechanismsData Augmentation for Tabular DataWeak Supervision and Distant LabelsSemi-Supervised Add-ons to SupervisedPrivacy-Preserving Feature EngineeringFederated Learning Basics for SupervisedSmall Data and High-D VariantsShortcut Learning and Spurious Correlation

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Handling Real-World Data Issues

Handling Real-World Data Issues

26074 views

Tackle noise, drift, imbalance, and other practical dataset challenges in production-like settings.

Content

5 of 15

Rare Events and Positive-Unlabeled Data

PU Problems but Make It Snappy
1234 views
intermediate
humorous
machine learning
visual
gpt-5-mini
1234 views

Versions:

PU Problems but Make It Snappy

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Rare Events and Positive-Unlabeled Data — The Needle-in-a-Haystack Special

Imagine you’re a fraud analyst whose dataset is basically one tiny neon sign that says "FRAUD" and a giant fog of "maybe". Welcome to rare events and positive-unlabeled (PU) problems — where the negatives are shy, the positives are proud, and the unlabeled are just confused.

This sits naturally after our chat about tree-based models and ensembles (remember: random forests love balanced data but will happily hallucinate signal when fed garbage) and ties into drift detection and temporal leakage concerns. If your positives are rare and drifting over time, and your labeling process leaks future info, you’ll get optimistic models that explode in production. Let’s fix that before your AUC deceives you into ruin.


What are Rare Events and PU data, and why should you care?

  • Rare events: target class frequency is tiny (think 0.1% — 1 in a thousand). Examples: fraud, some medical conditions, catastrophe prediction.
  • Positive-Unlabeled (PU) data: you have confidently labeled positives and a massive pool of unlabeled examples — unlabeled contains both positives and negatives, but the negatives aren't explicitly labeled.

Why this matters: classic supervised training assumes labeled positives and negatives. When negatives are unlabeled, naive approaches (treating unlabeled as negative) introduce bias and kill generalization. Ensembles can amplify that bias if you don't correct for it.


The core conceptual pitfalls (aka what will make your model lie)

  1. Label bias from selective labeling (SCAR vs. non-SCAR)
    • SCAR = Selected Completely At Random — the assumption that positives are labeled randomly among positives. If false, you need to model selection.
  2. Class prior unknown — you usually don’t know what fraction of unlabeled are actually positive.
  3. Evaluation blindness — accuracy/AUC on heuristic negative labels is meaningless; metrics must reflect uncertainty.
  4. Temporal/selection leakage — if labeling rules change over time, your model picks up the change, not the signal.

Quick tip: If positives are labeled more often when they’re obvious, your model learns to detect obviousness, not the underlying rare condition.


Practical toolbox: methods you can use (ranked by how principled vs quick-and-dirty they are)

1) Heuristics and data-level fixes (fast, common, risky)

  • Treat unlabeled as negatives (easy but biased)
  • Downsample unlabeled to get class balance (helps models but doesn’t fix label noise)
  • Pseudo-labeling: iteratively label high-confidence unlabeled as negative; dangerous if initial model is biased

2) PU-specific learning (principled)

  • Two-step methods (Spy, Elkan-Noto)
    • Use a small subset of positives as "spies" mixed into unlabeled to estimate probability of being negative.
  • Unbiased PU (uPU) risk estimators
    • Estimate risk using positive and unlabeled sets and an estimate of the class prior. Works well but can give negative empirical risks.
  • Non-negative PU (nnPU)
    • A variant of uPU that clips negative risk to zero — more stable with flexible models (neural nets/boosting).
  • EM-style approaches
    • Treat true class as latent and optimize jointly for labels and classifier.

3) One-class / anomaly detection

  • One-class SVM, isolation forest, autoencoders — treat positives as the only class and detect outliers. Works if positives have coherent structure.

4) Semi-supervised and active learning

  • Active labeling: query the most informative unlabeled items for human labels.
  • Label propagation: use graph-based methods to spread label info across similar examples.

Integrating with tree-based models and ensembles

Tree ensembles are flexible and can be adapted rather than replaced.

  • Sample weighting & calibrated losses: estimate class prior π (fraction positives in population) and assign sample weights or adjust loss to reflect PU risk estimators.
  • Bagging with biased sampling: create many classifiers trained against different unlabeled subsamples; ensemble averages reduce variance of incorrect negatives.
  • Gradient boosting with custom objective: implement uPU/nnPU loss as the boosting objective so each tree optimizes the PU-aware risk.
  • Pseudo-label carefully: only pseudo-label with very high-confidence predictions and validate using a holdout of labeled positives to avoid confirmation bias.

Code sketch (pseudo):

# Pseudocode: simple nnPU-ish training loop with tree ensemble
estimate_pi = estimate_class_prior(positives, unlabeled)
for round in 1..T:
    compute_pu_risk = positive_loss * pi + unlabeled_loss - pi * negative_loss_estimate
    gradient_step_with_tree(objective=nnPU_loss)
    clip_negative_risk_to_zero()

(Full implementations need careful numeric handling — but this shows where you hook into gradient boosting.)


Estimating the class prior (π)

Reliable π estimation is crucial. Methods:

  • EM-based estimation — alternate between estimating labels and π
  • Mixture proportion estimation — use density ratios or ROC-based techniques (Elkan & Noto) to estimate π
  • Anchor/spy methods — label a random sprinkling of positives as spies and measure recovery rate

Without a good π, your calibrated probabilities will be garbage.


Evaluation: how to know if your model is actually good?

  • Use precision@k if you care about top-ranked items (common in fraud)
  • Track recall on held-out positives (you must keep a verified positive holdout)
  • Compute lower/upper bounds for performance using estimated π
  • Prefer business metrics (expected losses, cost-sensitive evaluation) over generic accuracy

Table: quick comparison

Method Type Pros Cons
Heuristic labeling Fast, simple Biased, can mislead ensembles
PU-specific (nnPU) Principled, unbiased in theory Needs π, numeric tricks
One-class Good for coherent positives Fails if positives are diverse
Active learning Data-efficient Requires human labeling budget

Practical recipe (do this, not that)

  1. Keep a holdout of verified positives for evaluation.
  2. Try to understand labeling process — is it SCAR? If not, model selection bias.
  3. Estimate the class prior π before training.
  4. Start with a PU-specific method (nnPU or Elkan-Noto) or adapt your booster's loss.
  5. Use ensembles with subsampling/weighting, not naive negative labeling.
  6. Monitor precision@k, recall on held-out positives, and business impact.
  7. If possible, use active learning to get a small batch of labeled negatives — it often pays off.

Closing — TL;DR with a tiny pep talk

  • Rare events + unlabeled negatives = setup for overconfident disasters unless you treat the unlabeled properly.
  • Don’t pretend unlabeled = negative. Estimate class prior, use PU-aware loss, and evaluate with metrics that matter.
  • Tree ensembles are players, not magicians: adapt sampling/weights or the loss, and use ensembles to stabilize noisy assumptions.

Final mental image: your model should be a cautious detective, not a gossiping neighbor. Be skeptical, quantify uncertainty, and ask for a little more labeled truth when the stakes are high.

Ready to take this into practice? Next step: a runnable notebook showing nnPU with XGBoost-style custom objective and a class-prior estimator — want that?

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics