jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Supervised Machine Learning: Regression and Classification
Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

Decision Trees for RegressionDecision Trees for ClassificationImpurity and Splitting CriteriaPruning and Regularization of TreesHandling Missing Values in TreesRandom Forests EssentialsExtremely Randomized TreesGradient Boosting FundamentalsLearning Rate, Depth, and EstimatorsXGBoost, LightGBM, and CatBoostFeature Importance and PermutationPartial Dependence and ICE with TreesHandling Imbalanced Data with EnsemblesCalibration of Ensemble PredictionsStacking and Blending Strategies

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Tree-Based Models and Ensembles

Tree-Based Models and Ensembles

25059 views

Learn interpretable trees and powerful ensembles like random forests and gradient boosting.

Content

5 of 15

Handling Missing Values in Trees

The No-Chill Breakdown: Missing Values in Trees
4367 views
intermediate
humorous
visual
machine learning
gpt-5-mini
4367 views

Versions:

The No-Chill Breakdown: Missing Values in Trees

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Handling Missing Values in Trees — The No-Drama Guide

"Missing data: nature's way of whispering, 'you didn't check everything.' Trees respond: 'hold my split.'"

You just learned about impurity and splitting criteria (how trees pick the best questions) and pruning/regularization (how trees avoid turning into overfitted bonsai nightmares). Now imagine some of your features decide to play hide-and-seek: values go missing. How should tree-based models and ensembles handle that? Spoiler: trees are way more flexible than linear models and kernel methods, and boosting libraries have cool built-in hacks — but there are trade-offs.


Why this matters (and why it's different from kNN/SVM)

  • kNN and SVM are distance/kernel-based beasts: if a feature is missing, the distance or kernel computation gets awkward fast. You typically impute or compute distances on available features.
  • Trees, in contrast, make decisions feature-by-feature (one split at a time). That gives several native and clever ways to handle missingness without immediately resorting to global imputation.

Question: Would you rather patch the whole road (global imputation) or give drivers signs at each junction (split-level handling)? Trees let you do both.


The main strategies (quick map)

  1. Treat missing as a separate category (a.k.a. missingness-as-information)
  2. Surrogate splits (CART's classic approach)
  3. Learned default directions (XGBoost/LightGBM style)
  4. Imputation (mean/mode, model-based, multiple)
  5. Proximity-based imputation (Random Forest / missForest)
  6. Missing-indicator features

We'll unpack each, compare pros/cons, and show when to use what.


1) Treat missing as its own value — "Missingness is a signal"

  • For categorical features: add a level "MISSING" and split on it.
  • For numerical features: you can discretize then treat NA as a category, or explicitly create a rule like "x is NA".

Why it works: if missingness correlates with the target (e.g., no lab test ordered because doctor thought patient was low-risk), then missingness itself is predictive.

Pros: Simple, captures informative missingness. No imputation needed.
Cons: Can introduce bias if missingness is random; increases feature cardinality.


2) Surrogate splits — CART's elegant backup plan

How it works (short pseudocode):

1. Find best split S on feature A (ignores rows where A is missing).
2. For rows with A present, see how S assigns them: left/right.
3. Find another split S2 on some feature B that most closely matches S's left/right assignment.
4. Use S2 as a surrogate to send rows where A is missing.
5. You can have an ordered list of surrogates for fallback.

Why it's neat: the tree uses other features to emulate the missing split. It's local, split-specific, and respects the original splitting logic (impurity reduction basis).

Pros: Minimal imputation illusion, consistent with split logic, can be robust.
Cons: Computationally more expensive, not implemented in every library (scikit-learn does not support missing values in DecisionTree), and performance depends on whether good surrogates exist.


3) Learned default directions (XGBoost / LightGBM)

Gradient boosting libraries often: when a split is on feature A and value is missing, send the observation to the child (left or right) that yields better objective during training — effectively learning a default direction.

Pros: Fast, built-in, works remarkably well in practice for tabular data.
Cons: Implicit; you should know it's happening. It can hide missingness semantics (not explicit like surrogate splits).

Quick note: LightGBM treats NaN by finding the best split direction; XGBoost treats missing as a separate path learned to minimize loss.


4) Imputation — mean/mode, model-based, multiple

  • Simple imputation (mean/mode) is cheap but injects bias and underestimates uncertainty.
  • Model-based imputation (regression, kNN impute) uses other features to fill in realistic values.
  • Multiple imputation captures uncertainty by producing several imputed datasets and combining results.

When to impute for trees? If your chosen tree implementation can't handle NA (like vanilla scikit-learn trees), or when ensembles require complete matrices.

Pros: Works across models, integrates with cross-validation pipelines.
Cons: Risk of leakage if you impute using entire dataset; you must do it inside training folds.


5) Proximity-based imputation (Random Forest / missForest)

Random Forest defines a proximity between observations (how often they land in the same leaf). You can impute missing values by weighted averages of similar observations.

Pros: Often produces realistic imputations for mixed data.
Cons: More costly, but powerful for ensembles.


6) Missing-indicator features

Add a binary feature is_x_missing for each feature x. Use alongside imputation.

Why: if missingness is informative, the indicator makes that explicit and the tree can split on it.

Caveat: increases dimensionality; combine wisely with regularization/pruning to avoid overfitting (remember pruning/regularization discussion!).


Practical recipe — flowchart to choose a strategy

  1. Does your tree/ensemble library support missing values natively (LightGBM/XGBoost/CatBoost)?
    • Yes: try native handling (learned default or special NA handling) first.
    • No: go to step 2.
  2. Is missingness likely informative? (E.g., tests not ordered → clinical meaning)
    • Yes: add missing indicators; consider treating NA as a category.
  3. Is computation budget tight and you need quick & dirty? Use mean/mode with indicators.
  4. Need robustness and you're using Random Forests? Consider proximity or missForest.
  5. Want principled uncertainty? Use multiple imputation, then fit ensemble on each imputed dataset and pool.

Quick comparison table

Strategy Preserves locality? Captures informative NA? Cost Library-ready
Missing-as-value Yes Yes (explicit) Low Yes (categorical)
Surrogate splits Yes Yes Medium-High Some implementations
Learned default (XGBoost/LGB) Yes Implicitly yes Low Yes
Mean/mode impute + indicator No Yes (indicator) Low Yes
Proximity/missForest Yes Yes High Specialized
Multiple imputation No (per dataset) Yes (uncertainty) High Needs pipeline

Example: clinical lab dataset

Imagine a model predicting hospital readmission. Sodium test missing for many outpatients. If missingness correlates with healthy patients (doctor didn't order test), then missing = low risk. Treat missing as a signal — either with an indicator or missing-as-category — and the tree will happily split on it. If missing is random due to measurement error, use imputation.


Final words (yes, a mic drop)

Trees aren't helpless against missing data — they're nimble. Use the tree's structural advantages: split-level logic (surrogates), learned fallback routes (boosters), and explicit missing indicators when missingness has meaning. But don't get sloppy: naive imputation outside cross-validation leaks information and inflates apparent performance. Also remember our previous friends: impurity decisions and pruning matter here — adding missing indicators or surrogate logic increases model capacity and can overfit unless you regularize.

Key takeaways:

  • Try native handling first (XGBoost/LightGBM).
  • Treat missingness as information if it could be informative — use indicators or missing-as-category.
  • Use surrogate splits when available for a principled fallback.
  • Impute inside training folds to avoid leakage.

Trees will not magically fix garbage data, but they give you more ways to be clever than a distance function ever will. Use that power responsibly.


Version note: This builds on your earlier lessons about splitting criteria and pruning and connects forward to ensemble implementations (boosting defaults, forest proximities) and contrasts with kernel/distance-based approaches like kNN/SVM.

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics