Tree-Based Models and Ensembles
Learn interpretable trees and powerful ensembles like random forests and gradient boosting.
Content
Impurity and Splitting Criteria
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Impurity and Splitting Criteria — The Soul of a Decision Tree (and why it sometimes overreacts)
"A tree doesn't learn by staring at data — it learns by repeatedly asking, 'Is this split better than chaos?'"
You're already comfortable with Decision Trees for Regression and Classification (we covered structure, leaves, and predictions). You also remember kNN and kernels from Distance- and Kernel-Based Methods — those methods carve space with local neighborhoods or smooth decision boundaries. Trees take a different vibe: they greedily partition feature space into axis-aligned regions. The question at each node is: what is the best question to ask next? That question is decided by an impurity measure and its reduction.
Quick elevator pitch
- Impurity: a scalar that quantifies how "mixed" the labels are in a node. Low impurity = purer node (ideally one label).
- Splitting criterion: choose the feature and threshold that maximize the impurity decrease (i.e., parent impurity minus weighted child impurities).
Why this matters: the impurity you optimize determines how the tree grows, how it generalizes, and how it interacts with ensembles like Random Forests or Boosting.
The Usual Suspects: Classification
1) Gini impurity
Formula:
Gini(p) = 1 - sum_{k} p_k^2Interpretation: probability of mislabelling if you randomly pick a label according to class proportions.
Fast, smooth, and commonly used (scikit-learn default). Works great in practice.
2) Entropy (Information Gain)
Formula:
Entropy(p) = - sum_{k} p_k log2(p_k)Interpretation: average information (bits) required to encode the class label. Splitting that reduces entropy is information gain.
Often yields similar splits to Gini. Slightly more sensitive to changes near p=0 or 1.
3) Misclassification error (a blunt instrument)
Formula:
Error(p) = 1 - max_k p_kInterpretation: fraction of samples that would be misclassified if you labeled the node with the majority class.
Not great for splitting because it's less sensitive to changes in class proportions — usually used for pruning/inspection rather than split choice.
How splits are evaluated (classification)
For a candidate split s that produces left child L and right child R:
ImpurityDecrease(s) = I(parent) - (|L|/|parent|) I(L) - (|R|/|parent|) I(R)
Choose s that maximizes this. Note the weighted child impurities — larger child matters more.
Regression impurity: variance and friends
For regression trees, impurity commonly measures spread of y-values inside a node.
Variance / MSE (most common):
Var = (1/n) sum (y_i - mean_y)^2MAE-based splits: use absolute deviations (median-based). Less common because optimizing absolute loss for split search is trickier.
Split quality: reduction in variance (parent variance minus weighted child variances). That tells the tree how well a split concentrates similar y-values.
Practical notes: how do we search splits?
- For continuous features: sort unique values and consider thresholds at midpoints between consecutive distinct values. Complexity: O(n log n) per feature if sorted; faster with histograms and approximate quantiles.
- For categorical features: either test one-vs-rest or, for small cardinality, test all subset splits (2^{k-1}-1 possibilities) — explosive if many categories. Often we encode or use heuristics.
- Many libraries (LightGBM, XGBoost) use histogram/quantile binning to make this fast and memory friendly.
Pseudocode (very high-level):
for feature in features:
sort unique feature values
for threshold in midpoints:
partition data into L, R
compute impurity decrease
choose feature, threshold with max decrease
Connection to kNN/SVM (remember those neighbors and kernels?)
- kNN and kernel methods make local decisions: kNN averages labels of neighbors; kernel methods weigh points by smooth kernels. Trees are adaptive partitioners — they carve the space into blocks where predictions are constant (or average). Think step functions vs smooth kernels.
- Splitting criteria are a way to ask: "Where is the signal strong enough to create a new rectangular neighborhood?" While kNN asks: "Given a point, who are its neighbors?" Trees are global and axis-aligned; kernels & SVMs are global but can create curved boundaries.
- Ensembles (Random Forests) restore some locality by averaging many adaptive partitions — kind of like smoothing many different piecewise-constant estimates, producing a function that behaves more like a localized smoother.
Which impurity should you pick? (Practical cheat sheet)
| Task | Typical impurity | Why |
|---|---|---|
| Classification (general) | Gini | Fast and effective in practice |
| Classification (probabilities/interpretability) | Entropy | Aligns with information theory; slightly different sensitivity |
| Classification (pruning only) | Misclassification error | Useful for final node label selection, not split search |
| Regression | Variance / MSE | Directly minimizes squared prediction error; default in many libs |
| Robust regression | MAE-based | For heavy tails and outliers, but harder to optimize for split search |
Quick rule of thumb: if you don’t have a reason to change it, use Gini for classification and variance for regression.
Subtleties, gotchas, and power plays
- Greedy nature: a split that looks optimal locally may be suboptimal globally. Trees are greedy — hence ensembles (bagging, boosting) exist to fix that.
- Imbalanced classes: impurity measures can be dominated by the majority class; consider class weights, stratified splits, or different metrics (AUC-based objectives in boosting).
- Overfitting: maximizing impurity reduction can lead to tiny pure leaves with noisy decisions. Use depth limits, min_samples_leaf, or pruning.
- Feature scale and axis-aligned splits: trees don't care about scaling but are limited to axis-aligned partitions. If your signal is diagonal in feature space (like XOR rotated), trees need many splits.
Ensembles and impurity — why tiny differences matter
- Random Forests: build many deep trees (often unpruned) and average. If you use Gini vs Entropy, trees differ slightly — but averaging washes out many differences. More important: randomness (bootstrap + feature subsampling) reduces correlation.
- Boosting (e.g., XGBoost, LightGBM): fits trees to residuals or gradients derived from a loss function (e.g., squared error for regression, logistic deviance for classification). Here, impurity reduction is measured relative to the loss gradient; the split criterion essentially maximizes the gain in objective (not always classic Gini/Entropy).
Closing: TL;DR + Actionable Takeaways
- Impurity = how mixed a node is. Splitting = pick the split that maximizes impurity decrease.
- For classification: Gini and Entropy are the stars; misclassification error is blunt and rarely used for split search.
- For regression: variance (MSE) reduction is the standard impurity.
- Trees = axis-aligned adaptive partitioners. Compare to kNN/kernel methods which are local or smooth. Ensembles blend many trees to reduce variance or bias.
Final power tip: If your tree wildly overfits (memorizes a training set like a judge memorizing trivia), soften it with min_samples_leaf, max_depth, or switch to an ensemble — and remember: impurity choice often nudges behavior but tuning complexity regularization and feature handling moves mountains.
"Pick a sensible impurity, regularize hard, and when in doubt—ensemble it."
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!