Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

Decision Trees for Regression Decision Trees for Classification Impurity and Splitting Criteria Pruning and Regularization of Trees Handling Missing Values in Trees Random Forests Essentials Extremely Randomized Trees Gradient Boosting Fundamentals Learning Rate, Depth, and Estimators XGBoost, LightGBM, and CatBoost Feature Importance and Permutation Partial Dependence and ICE with Trees Handling Imbalanced Data with Ensembles Calibration of Ensemble Predictions Stacking and Blending Strategies

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Tree-Based Models and Ensembles

Tree-Based Models and Ensembles

25073 views

Learn interpretable trees and powerful ensembles like random forests and gradient boosting.

Content

4 of 15

Pruning and Regularization of Trees

Prune It Like It's Hot — Regularizing Trees with Sass

3953 views

intermediate

humorous

visual

machine learning

gpt-5-mini

3953 views

Versions:

Prune It Like It's Hot — Regularizing Trees with Sass

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Prune It Like It's Hot — Regularizing Trees with Sass

"A decision tree that remembers the entire training set is a tree that never learned to generalize — it's just really good at gossip."

Hook — quick reality check

Remember how we split on impurity (Gini, entropy) to greedily carve up feature space? And how decision trees for classification give us those crisp, axis-aligned regions that can sometimes look like Lego castles — perfect on the training set, tragic at test time? Good. You already know the problem: trees are powerful but prone to overfitting. Now we'll learn how to tame them with pruning and regularization — the horticulture and the discipline of the tree world.

This sits naturally after our coverage of splitting criteria and recent detours into kNN and SVM: kNN and kernels gave you smooth or local decision boundaries; a single decision tree gives you piecewise-constant partitions. Pruning + ensembles bridge the gap: they smooth, regularize, and keep your model from memorizing weird artifacts in the data.

What problem are we solving? (Not rhetorical — it's dramatic)

Trees grow until every leaf is pure (or until a cutoff), which reduces training error but inflates variance.
We want models that: generalize well, remain interpretable when possible, and don't overreact to noise.

In short: control complexity. Two families of tools do this:

Pre-pruning (early stopping) — stop growth before the tree gets silly.
Post-pruning (cost-complexity / reduced error) — grow first, then surgically remove branches.

Both are forms of regularization for decision trees.

Measuring complexity and the bias-variance tradeoff

Key complexity measures:

Depth of the tree
Number of leaves (terminal nodes)
Number of splits

Too simple → high bias (underfit). Too complex → high variance (overfit). Think of alpha in pruning as the "responsibility tax" for complexity. Pay more tax (higher alpha) → simpler tree.

Mathematically, in cost-complexity pruning we minimize:

R_alpha(T) = R(T) + alpha * |T_leaves|

Where R(T) is the subtree error (e.g., misclassification or impurity-based cost), |T_leaves| is the number of leaves, and alpha >= 0 balances fit vs complexity.

Pre-pruning (the "I have self-control" approach)

Shut the tree down early. Typical knobs:

max_depth: maximum depth of the tree.
min_samples_split / min_samples_leaf: minimum samples to split or to allow a leaf.
min_impurity_decrease: require a minimum reduction in impurity to split.
max_leaf_nodes: force a cap on leaves.

Pros:

Fast, memory-efficient.
Simple hyperparameters to search.

Cons:

Greedy decisions may stop promising branches prematurely.
Less optimal than pruning based on full-grown tree.

Practical heuristics: set min_samples_leaf relative to dataset size (e.g., 1-5% for many tasks), use max_depth around 5–15 depending on problem complexity.

Post-pruning (the "let it grow, then tidy up" approach)

Grow a full tree (or near-full), then remove subtrees whose removal improves or negligibly hurts validation performance.

Main algorithms:

Cost-Complexity Pruning (CCP / weakest link pruning) — compute effective alphas and produce a sequence of subtrees. Use cross-validation to pick alpha.
Reduced Error Pruning — replace subtrees if doing so doesn't reduce validation accuracy.

Pseudocode (CCP, high level):

1. Grow full tree T0.
2. For each internal node, compute the alpha at which collapsing this node reduces R_alpha the least (the "weakest link").
3. Iteratively prune the weakest link to get sequence T0, T1, T2, ..., Tk.
4. Use cross-validation to choose the best Ti (i.e., choose alpha).

In sklearn, this is exposed via the ccp_alpha parameter: you can compute the pruning path and then select using cross-validation.

Concrete example (mini scenario)

Imagine a tree with 100 leaves that perfectly fits training noise. Cost-complexity pruning evaluates collapsing nodes and finds that removing 20 leaves increases training error slightly but reduces R_alpha for moderate alpha. Cross-validation shows the model with 30 leaves has best test performance. Boom — simpler, faster, less dramatic at parties.

How pruning connects to kNN and SVM intuition

kNN: local decisions based on neighborhoods. Trees partition space and then act uniformly inside partitions. Very small leaves = very local behavior (like k small), large leaves = more global.
SVM / kernels: produce smooth boundaries. Ensembles (e.g., Random Forest) or many pruned trees averaged together can approximate smoother boundaries, reducing the axis-aligned jaggedness of single trees.

So: pruning reduces locality (variance) and, when combined with ensembling, gives you smoother, more robust decision boundaries similar in spirit to kernel smoothness but built from many piecewise-constant models.

Regularization beyond pruning — practical knobs across models

DecisionTreeClassifier / Regressor (sklearn): max_depth, min_samples_leaf, min_samples_split, min_impurity_decrease, max_leaf_nodes, ccp_alpha.

Ensembles:

Random Forest: reduces variance — regularize via n_estimators, max_depth, max_features, min_samples_leaf.
Boosting (XGBoost / LightGBM): reduces bias but can overfit; regularize via learning_rate (shrinkage), max_depth, n_estimators, reg_alpha (L1), reg_lambda (L2), subsample, colsample_bytree, min_child_weight.

Tip: boosting + deep trees = easy overfit. So prefer shallow trees (depth 3-8) and small learning rates (0.01–0.1) with early stopping.

Quick comparison table

Strategy	When to use	Pros	Cons
Pre-pruning	When you need speed & memory	Simple, fast	May prematurely stop good splits
CCP / Post-pruning	When interpretability & optimal subtree matter	Often better final tree, principled	Needs full tree + CV, more compute
Ensemble regularization	When accuracy matters most	Great generalization (RF/Boost)	Less interpretable, many hyperparams

Practical recipe — what I do in the lab (and you should try)

Start simple: build a tree with conservative pre-pruning (max_depth=8, min_samples_leaf=10% for small data or 1–5% for mid-sized).
If interpretability matters: grow full tree + CCP path, use cross-validation to choose ccp_alpha.
If performance is king: try Random Forest (regularize with max_features, max_depth) or Gradient Boosting (shrinkage, shallow trees, early stopping).
Use learning curves and validation curves to detect high variance (gap between train/test).
When in doubt, ensemble — but save a pruned single tree for explanation.

Common pitfalls & how to avoid them

Relying solely on training accuracy: use cross-validation.
Pruning without a validation set: you might prune the wrong branches.
Overcomplicating the model with many tiny leaves: prefer ensembles for that complexity.
Ignoring class imbalance: pruning decisions can be skewed — use class weights or balanced splits.

Closing — the elegant truth

Pruning and regularization convert a flamboyant memorizer into a reasonable, generalizing citizen. Pre-pruning is the therapist setting boundaries early; post-pruning is the exorcism after the tree devoured the dataset. Ensembles are your supportive community group therapy — they reduce variance and make performance stable.

Key takeaways:

Control tree complexity via depth, leaves, and cost-complexity alpha.
Use CCP for principled post-pruning; pre-pruning for efficiency.
Combine pruning with ensemble strategies for best-of-both-worlds performance.

Final thought: a pruned tree is not just smaller — it tells a cleaner story. When your model can explain itself simply, you've probably learned the right lesson from the data.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics