Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

Decision Trees for Regression Decision Trees for Classification Impurity and Splitting Criteria Pruning and Regularization of Trees Handling Missing Values in Trees Random Forests Essentials Extremely Randomized Trees Gradient Boosting Fundamentals Learning Rate, Depth, and Estimators XGBoost, LightGBM, and CatBoost Feature Importance and Permutation Partial Dependence and ICE with Trees Handling Imbalanced Data with Ensembles Calibration of Ensemble Predictions Stacking and Blending Strategies

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Tree-Based Models and Ensembles

Tree-Based Models and Ensembles

25073 views

Learn interpretable trees and powerful ensembles like random forests and gradient boosting.

Content

5 of 15

Handling Missing Values in Trees

The No-Chill Breakdown: Missing Values in Trees

4372 views

intermediate

humorous

visual

machine learning

gpt-5-mini

4372 views

Versions:

The No-Chill Breakdown: Missing Values in Trees

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Handling Missing Values in Trees — The No-Drama Guide

"Missing data: nature's way of whispering, 'you didn't check everything.' Trees respond: 'hold my split.'"

You just learned about impurity and splitting criteria (how trees pick the best questions) and pruning/regularization (how trees avoid turning into overfitted bonsai nightmares). Now imagine some of your features decide to play hide-and-seek: values go missing. How should tree-based models and ensembles handle that? Spoiler: trees are way more flexible than linear models and kernel methods, and boosting libraries have cool built-in hacks — but there are trade-offs.

Why this matters (and why it's different from kNN/SVM)

kNN and SVM are distance/kernel-based beasts: if a feature is missing, the distance or kernel computation gets awkward fast. You typically impute or compute distances on available features.
Trees, in contrast, make decisions feature-by-feature (one split at a time). That gives several native and clever ways to handle missingness without immediately resorting to global imputation.

Question: Would you rather patch the whole road (global imputation) or give drivers signs at each junction (split-level handling)? Trees let you do both.

The main strategies (quick map)

Treat missing as a separate category (a.k.a. missingness-as-information)
Surrogate splits (CART's classic approach)
Learned default directions (XGBoost/LightGBM style)
Imputation (mean/mode, model-based, multiple)
Proximity-based imputation (Random Forest / missForest)
Missing-indicator features

We'll unpack each, compare pros/cons, and show when to use what.

1) Treat missing as its own value — "Missingness is a signal"

For categorical features: add a level "MISSING" and split on it.
For numerical features: you can discretize then treat NA as a category, or explicitly create a rule like "x is NA".

Why it works: if missingness correlates with the target (e.g., no lab test ordered because doctor thought patient was low-risk), then missingness itself is predictive.

Pros: Simple, captures informative missingness. No imputation needed.
Cons: Can introduce bias if missingness is random; increases feature cardinality.

2) Surrogate splits — CART's elegant backup plan

How it works (short pseudocode):

1. Find best split S on feature A (ignores rows where A is missing).
2. For rows with A present, see how S assigns them: left/right.
3. Find another split S2 on some feature B that most closely matches S's left/right assignment.
4. Use S2 as a surrogate to send rows where A is missing.
5. You can have an ordered list of surrogates for fallback.

Why it's neat: the tree uses other features to emulate the missing split. It's local, split-specific, and respects the original splitting logic (impurity reduction basis).

Pros: Minimal imputation illusion, consistent with split logic, can be robust.
Cons: Computationally more expensive, not implemented in every library (scikit-learn does not support missing values in DecisionTree), and performance depends on whether good surrogates exist.

3) Learned default directions (XGBoost / LightGBM)

Gradient boosting libraries often: when a split is on feature A and value is missing, send the observation to the child (left or right) that yields better objective during training — effectively learning a default direction.

Pros: Fast, built-in, works remarkably well in practice for tabular data.
Cons: Implicit; you should know it's happening. It can hide missingness semantics (not explicit like surrogate splits).

Quick note: LightGBM treats NaN by finding the best split direction; XGBoost treats missing as a separate path learned to minimize loss.

4) Imputation — mean/mode, model-based, multiple

Simple imputation (mean/mode) is cheap but injects bias and underestimates uncertainty.
Model-based imputation (regression, kNN impute) uses other features to fill in realistic values.
Multiple imputation captures uncertainty by producing several imputed datasets and combining results.

When to impute for trees? If your chosen tree implementation can't handle NA (like vanilla scikit-learn trees), or when ensembles require complete matrices.

Pros: Works across models, integrates with cross-validation pipelines.
Cons: Risk of leakage if you impute using entire dataset; you must do it inside training folds.

5) Proximity-based imputation (Random Forest / missForest)

Random Forest defines a proximity between observations (how often they land in the same leaf). You can impute missing values by weighted averages of similar observations.

Pros: Often produces realistic imputations for mixed data.
Cons: More costly, but powerful for ensembles.

6) Missing-indicator features

Add a binary feature is_x_missing for each feature x. Use alongside imputation.

Why: if missingness is informative, the indicator makes that explicit and the tree can split on it.

Caveat: increases dimensionality; combine wisely with regularization/pruning to avoid overfitting (remember pruning/regularization discussion!).

Practical recipe — flowchart to choose a strategy

Does your tree/ensemble library support missing values natively (LightGBM/XGBoost/CatBoost)?
- Yes: try native handling (learned default or special NA handling) first.
- No: go to step 2.
Is missingness likely informative? (E.g., tests not ordered → clinical meaning)
- Yes: add missing indicators; consider treating NA as a category.
Is computation budget tight and you need quick & dirty? Use mean/mode with indicators.
Need robustness and you're using Random Forests? Consider proximity or missForest.
Want principled uncertainty? Use multiple imputation, then fit ensemble on each imputed dataset and pool.

Quick comparison table

Strategy	Preserves locality?	Captures informative NA?	Cost	Library-ready
Missing-as-value	Yes	Yes (explicit)	Low	Yes (categorical)
Surrogate splits	Yes	Yes	Medium-High	Some implementations
Learned default (XGBoost/LGB)	Yes	Implicitly yes	Low	Yes
Mean/mode impute + indicator	No	Yes (indicator)	Low	Yes
Proximity/missForest	Yes	Yes	High	Specialized
Multiple imputation	No (per dataset)	Yes (uncertainty)	High	Needs pipeline

Example: clinical lab dataset

Imagine a model predicting hospital readmission. Sodium test missing for many outpatients. If missingness correlates with healthy patients (doctor didn't order test), then missing = low risk. Treat missing as a signal — either with an indicator or missing-as-category — and the tree will happily split on it. If missing is random due to measurement error, use imputation.

Final words (yes, a mic drop)

Trees aren't helpless against missing data — they're nimble. Use the tree's structural advantages: split-level logic (surrogates), learned fallback routes (boosters), and explicit missing indicators when missingness has meaning. But don't get sloppy: naive imputation outside cross-validation leaks information and inflates apparent performance. Also remember our previous friends: impurity decisions and pruning matter here — adding missing indicators or surrogate logic increases model capacity and can overfit unless you regularize.

Key takeaways:

Try native handling first (XGBoost/LightGBM).
Treat missingness as information if it could be informative — use indicators or missing-as-category.
Use surrogate splits when available for a principled fallback.
Impute inside training folds to avoid leakage.

Trees will not magically fix garbage data, but they give you more ways to be clever than a distance function ever will. Use that power responsibly.

Version note: This builds on your earlier lessons about splitting criteria and pruning and connects forward to ensemble implementations (boosting defaults, forest proximities) and contrasts with kernel/distance-based approaches like kNN/SVM.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics