Tree-Based Models and Ensembles
Learn interpretable trees and powerful ensembles like random forests and gradient boosting.
Content
Decision Trees for Regression
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Decision Trees for Regression — The Tree That Predicts Your Rent (and Judges Your Life Choices)
"If kNN is the friendly neighbor who averages everyone’s opinion, and SVM is the strict bouncer carving a crisp boundary, decision trees are the extroverted realtor who divides the city into neighborhoods until prices look sensible."
You're coming from a world of distance- and kernel-based methods (kNN, SVM). Those approaches leaned on neighborhoods and smooth kernels to handle nonlinearity. Now we pivot to a different kind of local thinking: partition the feature space into chunks where the response behaves similarly, and then predict with a summary statistic (usually the mean). Welcome to regression trees.
Quick orientation (no rerun of old material)
You already know how locality (kNN) and margin/feature mappings (SVM) give nonlinear power. Trees use space partitioning instead: they split features into axis-aligned regions and fit a constant (or simple) model in each region. This makes them extremely interpretable, fast, and flexible — but also dramatic and sometimes a tad overconfident.
What is a regression tree, in plain and mildly theatrical English?
- Definition (short): A regression tree recursively splits the feature space into disjoint regions and predicts the average target value in each final region (leaf).
- Intuition: Imagine repeatedly slicing a pizza (feature space) along one topping at a time (features) until every slice tastes roughly the same (response variance low). The pizza chef? The CART algorithm.
The algorithm (CART for regression) — step-by-step
- Start with all training data in one node.
- For every candidate split (choose a feature and a cut value), compute how much the split reduces variance of the target.
- Pick the split that gives the largest variance reduction.
- Recurse on each child node until stopping criteria (max depth, min samples, or no improvement).
- The prediction at a leaf = mean(y) of training examples in that leaf.
The math behind the glamour: variance reduction
If node t has n_t observations and variance Var(t), and a split produces left child L and right child R, the impurity decrease (also called reduction in MSE) is:
Δ = Var(t) - (n_L/n_t) * Var(L) - (n_R/n_t) * Var(R)
We pick the split with the largest Δ. Simpler than some kernels, but surprisingly effective.
Pseudocode (because we like order)
function build_tree(data, depth=0):
if stopping_condition(data, depth):
return leaf(mean(targets))
best_split = argmax_over_splits(variance_reduction)
left, right = split(data, best_split)
node.left = build_tree(left, depth+1)
node.right = build_tree(right, depth+1)
return node
Stopping conditions: max depth, min samples per leaf, or no split improves variance significantly.
Why use regression trees? (Pros & snarky metaphors)
- Interpretability: You can follow the path: "If bedrooms >= 3 AND distance_to_subway < 1km THEN price ≈ $X". Like a decision checklist from your overbearing aunt.
- Handles mixed data types: Numeric and categorical features live together happily — no need to scale.
- Robust to outliers (to some degree): Leaves average targets, so single weird points can get isolated instead of poisoning a global model.
- Fast inference and little preprocessing.
And the drawbacks (the tree’s kryptonite)
- High variance: Small data changes can yield very different trees — unstable like a soap opera character.
- Axis-aligned splits only: Trees partition along single features at a time; they don’t create diagonal decision boundaries unless you combine many splits.
- Not smooth: Predictions jump from leaf to leaf — unlike kernel methods that produce smooth functions.
Comparing to kNN and SVR — quick table
| Property | kNN | SVR / Kernel methods | Regression Trees |
|---|---|---|---|
| Locality | Neighborhood averaging | Global via kernel transform | Local via space partitioning |
| Smoothness | Smooth (depends on k) | Smooth (depends on kernel) | Piecewise-constant (not smooth) |
| Interpretability | Low | Low-medium | High |
| Handles mixed features | Yes (but need distance choice) | Usually needs numeric & scaled | Yes, naturally |
| Robustness to noise | Sensitive (k small) | Controlled by C, epsilon | Can overfit unless pruned |
Ask yourself: do you want a smooth predictor or a readable rulebook? That determines much.
Practical knobs: controlling complexity and avoiding overfitting
Pre-pruning (early stopping): max_depth, min_samples_split, min_samples_leaf, max_leaf_nodes.
Post-pruning (cost-complexity pruning): Grow a big tree, then prune using a complexity cost:
Cost(T) = RSS(T) + α * |leaves(T)|
where α is the complexity parameter (higher α = more pruning). sklearn exposes this as ccp_alpha.
Cross-validation: Choose pruning parameter (e.g., ccp_alpha) with CV to trade bias vs variance.
Feature importance & interpretability tools
- Gini/variance-based importance: Sum of impurity decreases for splits that use a feature. Convenient but biased toward features with many possible splits.
- Permutation importance: Shuffle a feature and measure how much performance drops — a model-agnostic check.
- Partial dependence plots (PDP): Show average model prediction while varying a feature — helps understand marginal effects.
Handling real-life annoyances
- Missing values: CART can use surrogate splits (alternate features that mimic the primary split) or treat missing as a category.
- Categorical variables: Trees handle them naturally; many implementations do binary splits for categories.
- Heteroscedasticity & non-constant variance: Trees are flexible enough to isolate regions of different variance, but they don't model variance explicitly unless you augment them.
Worked example snapshot — predicting house prices
Imagine data: {num_bedrooms, sqft, distance_to_center}. The tree might first split on sqft > 1,200. In the left region (small houses), it might split on distance_to_center; in the right region (large houses), it might split on bedrooms. The leaf predictions are means of prices for examples reaching those leaves. No smoothing — just clear neighborhood-level rules.
Why might this beat kNN here? Because the tree gives simple rules that segment price drivers, rather than averaging across potentially irrelevant neighbors.
When to choose a regression tree (short checklist)
- You want model interpretability/rules.
- You have mixed feature types and minimal preprocessing time.
- You need a fast, low-friction baseline.
- You’re okay with piecewise predictions or you’ll wrap the tree in an ensemble (coming up next).
Next step (teaser)
Single trees are great, but they’re unstable. Ensembles (Random Forests, Gradient Boosting) combine many trees to reduce variance and improve accuracy — the next thrilling act in our course.
Key takeaways (memorize these like tiny dramatic revelations)
- Regression trees partition feature space and predict leaf means. They're simple, interpretable, and can capture nonlinearity with axis-aligned splits.
- Splits are chosen by variance reduction (ΔMSE). Prediction = mean(y) in leaf.
- Main weaknesses: high variance and non-smooth predictions — but these are exactly why ensembles exist.
Parting thought: Trees give you readable, human-friendly rules. If you want clinical, smooth curves, reach for kernels. If you want a rulebook that fits your messy dataset like a glove (sometimes a patchwork glove), start with a tree and then ensemble it if it gets dramatic.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!