Tree-Based Models and Ensembles
Learn interpretable trees and powerful ensembles like random forests and gradient boosting.
Content
Pruning and Regularization of Trees
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Prune It Like It's Hot — Regularizing Trees with Sass
"A decision tree that remembers the entire training set is a tree that never learned to generalize — it's just really good at gossip."
Hook — quick reality check
Remember how we split on impurity (Gini, entropy) to greedily carve up feature space? And how decision trees for classification give us those crisp, axis-aligned regions that can sometimes look like Lego castles — perfect on the training set, tragic at test time? Good. You already know the problem: trees are powerful but prone to overfitting. Now we'll learn how to tame them with pruning and regularization — the horticulture and the discipline of the tree world.
This sits naturally after our coverage of splitting criteria and recent detours into kNN and SVM: kNN and kernels gave you smooth or local decision boundaries; a single decision tree gives you piecewise-constant partitions. Pruning + ensembles bridge the gap: they smooth, regularize, and keep your model from memorizing weird artifacts in the data.
What problem are we solving? (Not rhetorical — it's dramatic)
- Trees grow until every leaf is pure (or until a cutoff), which reduces training error but inflates variance.
- We want models that: generalize well, remain interpretable when possible, and don't overreact to noise.
In short: control complexity. Two families of tools do this:
- Pre-pruning (early stopping) — stop growth before the tree gets silly.
- Post-pruning (cost-complexity / reduced error) — grow first, then surgically remove branches.
Both are forms of regularization for decision trees.
Measuring complexity and the bias-variance tradeoff
Key complexity measures:
- Depth of the tree
- Number of leaves (terminal nodes)
- Number of splits
Too simple → high bias (underfit). Too complex → high variance (overfit). Think of alpha in pruning as the "responsibility tax" for complexity. Pay more tax (higher alpha) → simpler tree.
Mathematically, in cost-complexity pruning we minimize:
R_alpha(T) = R(T) + alpha * |T_leaves|
Where R(T) is the subtree error (e.g., misclassification or impurity-based cost), |T_leaves| is the number of leaves, and alpha >= 0 balances fit vs complexity.
Pre-pruning (the "I have self-control" approach)
Shut the tree down early. Typical knobs:
- max_depth: maximum depth of the tree.
- min_samples_split / min_samples_leaf: minimum samples to split or to allow a leaf.
- min_impurity_decrease: require a minimum reduction in impurity to split.
- max_leaf_nodes: force a cap on leaves.
Pros:
- Fast, memory-efficient.
- Simple hyperparameters to search.
Cons:
- Greedy decisions may stop promising branches prematurely.
- Less optimal than pruning based on full-grown tree.
Practical heuristics: set min_samples_leaf relative to dataset size (e.g., 1-5% for many tasks), use max_depth around 5–15 depending on problem complexity.
Post-pruning (the "let it grow, then tidy up" approach)
Grow a full tree (or near-full), then remove subtrees whose removal improves or negligibly hurts validation performance.
Main algorithms:
- Cost-Complexity Pruning (CCP / weakest link pruning) — compute effective alphas and produce a sequence of subtrees. Use cross-validation to pick alpha.
- Reduced Error Pruning — replace subtrees if doing so doesn't reduce validation accuracy.
Pseudocode (CCP, high level):
1. Grow full tree T0.
2. For each internal node, compute the alpha at which collapsing this node reduces R_alpha the least (the "weakest link").
3. Iteratively prune the weakest link to get sequence T0, T1, T2, ..., Tk.
4. Use cross-validation to choose the best Ti (i.e., choose alpha).
In sklearn, this is exposed via the ccp_alpha parameter: you can compute the pruning path and then select using cross-validation.
Concrete example (mini scenario)
Imagine a tree with 100 leaves that perfectly fits training noise. Cost-complexity pruning evaluates collapsing nodes and finds that removing 20 leaves increases training error slightly but reduces R_alpha for moderate alpha. Cross-validation shows the model with 30 leaves has best test performance. Boom — simpler, faster, less dramatic at parties.
How pruning connects to kNN and SVM intuition
- kNN: local decisions based on neighborhoods. Trees partition space and then act uniformly inside partitions. Very small leaves = very local behavior (like k small), large leaves = more global.
- SVM / kernels: produce smooth boundaries. Ensembles (e.g., Random Forest) or many pruned trees averaged together can approximate smoother boundaries, reducing the axis-aligned jaggedness of single trees.
So: pruning reduces locality (variance) and, when combined with ensembling, gives you smoother, more robust decision boundaries similar in spirit to kernel smoothness but built from many piecewise-constant models.
Regularization beyond pruning — practical knobs across models
DecisionTreeClassifier / Regressor (sklearn): max_depth, min_samples_leaf, min_samples_split, min_impurity_decrease, max_leaf_nodes, ccp_alpha.
Ensembles:
- Random Forest: reduces variance — regularize via n_estimators, max_depth, max_features, min_samples_leaf.
- Boosting (XGBoost / LightGBM): reduces bias but can overfit; regularize via learning_rate (shrinkage), max_depth, n_estimators, reg_alpha (L1), reg_lambda (L2), subsample, colsample_bytree, min_child_weight.
Tip: boosting + deep trees = easy overfit. So prefer shallow trees (depth 3-8) and small learning rates (0.01–0.1) with early stopping.
Quick comparison table
| Strategy | When to use | Pros | Cons |
|---|---|---|---|
| Pre-pruning | When you need speed & memory | Simple, fast | May prematurely stop good splits |
| CCP / Post-pruning | When interpretability & optimal subtree matter | Often better final tree, principled | Needs full tree + CV, more compute |
| Ensemble regularization | When accuracy matters most | Great generalization (RF/Boost) | Less interpretable, many hyperparams |
Practical recipe — what I do in the lab (and you should try)
- Start simple: build a tree with conservative pre-pruning (max_depth=8, min_samples_leaf=10% for small data or 1–5% for mid-sized).
- If interpretability matters: grow full tree + CCP path, use cross-validation to choose ccp_alpha.
- If performance is king: try Random Forest (regularize with max_features, max_depth) or Gradient Boosting (shrinkage, shallow trees, early stopping).
- Use learning curves and validation curves to detect high variance (gap between train/test).
- When in doubt, ensemble — but save a pruned single tree for explanation.
Common pitfalls & how to avoid them
- Relying solely on training accuracy: use cross-validation.
- Pruning without a validation set: you might prune the wrong branches.
- Overcomplicating the model with many tiny leaves: prefer ensembles for that complexity.
- Ignoring class imbalance: pruning decisions can be skewed — use class weights or balanced splits.
Closing — the elegant truth
Pruning and regularization convert a flamboyant memorizer into a reasonable, generalizing citizen. Pre-pruning is the therapist setting boundaries early; post-pruning is the exorcism after the tree devoured the dataset. Ensembles are your supportive community group therapy — they reduce variance and make performance stable.
Key takeaways:
- Control tree complexity via depth, leaves, and cost-complexity alpha.
- Use CCP for principled post-pruning; pre-pruning for efficiency.
- Combine pruning with ensemble strategies for best-of-both-worlds performance.
Final thought: a pruned tree is not just smaller — it tells a cleaner story. When your model can explain itself simply, you've probably learned the right lesson from the data.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!