Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

Decision Trees for Regression Decision Trees for Classification Impurity and Splitting Criteria Pruning and Regularization of Trees Handling Missing Values in Trees Random Forests Essentials Extremely Randomized Trees Gradient Boosting Fundamentals Learning Rate, Depth, and Estimators XGBoost, LightGBM, and CatBoost Feature Importance and Permutation Partial Dependence and ICE with Trees Handling Imbalanced Data with Ensembles Calibration of Ensemble Predictions Stacking and Blending Strategies

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Tree-Based Models and Ensembles

Tree-Based Models and Ensembles

25073 views

Learn interpretable trees and powerful ensembles like random forests and gradient boosting.

Content

1 of 15

Decision Trees for Regression

Tree Therapy: Regression Edition

3387 views

intermediate

humorous

science

visual

gpt-5-mini

3387 views

Versions:

Tree Therapy: Regression Edition

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Decision Trees for Regression — The Tree That Predicts Your Rent (and Judges Your Life Choices)

"If kNN is the friendly neighbor who averages everyone’s opinion, and SVM is the strict bouncer carving a crisp boundary, decision trees are the extroverted realtor who divides the city into neighborhoods until prices look sensible."

You're coming from a world of distance- and kernel-based methods (kNN, SVM). Those approaches leaned on neighborhoods and smooth kernels to handle nonlinearity. Now we pivot to a different kind of local thinking: partition the feature space into chunks where the response behaves similarly, and then predict with a summary statistic (usually the mean). Welcome to regression trees.

Quick orientation (no rerun of old material)

You already know how locality (kNN) and margin/feature mappings (SVM) give nonlinear power. Trees use space partitioning instead: they split features into axis-aligned regions and fit a constant (or simple) model in each region. This makes them extremely interpretable, fast, and flexible — but also dramatic and sometimes a tad overconfident.

What is a regression tree, in plain and mildly theatrical English?

Definition (short): A regression tree recursively splits the feature space into disjoint regions and predicts the average target value in each final region (leaf).
Intuition: Imagine repeatedly slicing a pizza (feature space) along one topping at a time (features) until every slice tastes roughly the same (response variance low). The pizza chef? The CART algorithm.

The algorithm (CART for regression) — step-by-step

Start with all training data in one node.
For every candidate split (choose a feature and a cut value), compute how much the split reduces variance of the target.
Pick the split that gives the largest variance reduction.
Recurse on each child node until stopping criteria (max depth, min samples, or no improvement).
The prediction at a leaf = mean(y) of training examples in that leaf.

The math behind the glamour: variance reduction

If node t has n_t observations and variance Var(t), and a split produces left child L and right child R, the impurity decrease (also called reduction in MSE) is:

Δ = Var(t) - (n_L/n_t) * Var(L) - (n_R/n_t) * Var(R)

We pick the split with the largest Δ. Simpler than some kernels, but surprisingly effective.

Pseudocode (because we like order)

function build_tree(data, depth=0):
    if stopping_condition(data, depth):
        return leaf(mean(targets))
    best_split = argmax_over_splits(variance_reduction)
    left, right = split(data, best_split)
    node.left = build_tree(left, depth+1)
    node.right = build_tree(right, depth+1)
    return node

Stopping conditions: max depth, min samples per leaf, or no split improves variance significantly.

Why use regression trees? (Pros & snarky metaphors)

Interpretability: You can follow the path: "If bedrooms >= 3 AND distance_to_subway < 1km THEN price ≈ $X". Like a decision checklist from your overbearing aunt.
Handles mixed data types: Numeric and categorical features live together happily — no need to scale.
Robust to outliers (to some degree): Leaves average targets, so single weird points can get isolated instead of poisoning a global model.
Fast inference and little preprocessing.

And the drawbacks (the tree’s kryptonite)

High variance: Small data changes can yield very different trees — unstable like a soap opera character.
Axis-aligned splits only: Trees partition along single features at a time; they don’t create diagonal decision boundaries unless you combine many splits.
Not smooth: Predictions jump from leaf to leaf — unlike kernel methods that produce smooth functions.

Comparing to kNN and SVR — quick table

Property	kNN	SVR / Kernel methods	Regression Trees
Locality	Neighborhood averaging	Global via kernel transform	Local via space partitioning
Smoothness	Smooth (depends on k)	Smooth (depends on kernel)	Piecewise-constant (not smooth)
Interpretability	Low	Low-medium	High
Handles mixed features	Yes (but need distance choice)	Usually needs numeric & scaled	Yes, naturally
Robustness to noise	Sensitive (k small)	Controlled by C, epsilon	Can overfit unless pruned

Ask yourself: do you want a smooth predictor or a readable rulebook? That determines much.

Practical knobs: controlling complexity and avoiding overfitting

Pre-pruning (early stopping): max_depth, min_samples_split, min_samples_leaf, max_leaf_nodes.
Post-pruning (cost-complexity pruning): Grow a big tree, then prune using a complexity cost:

Cost(T) = RSS(T) + α * |leaves(T)|

where α is the complexity parameter (higher α = more pruning). sklearn exposes this as ccp_alpha.
Cross-validation: Choose pruning parameter (e.g., ccp_alpha) with CV to trade bias vs variance.

Feature importance & interpretability tools

Gini/variance-based importance: Sum of impurity decreases for splits that use a feature. Convenient but biased toward features with many possible splits.
Permutation importance: Shuffle a feature and measure how much performance drops — a model-agnostic check.
Partial dependence plots (PDP): Show average model prediction while varying a feature — helps understand marginal effects.

Handling real-life annoyances

Missing values: CART can use surrogate splits (alternate features that mimic the primary split) or treat missing as a category.
Categorical variables: Trees handle them naturally; many implementations do binary splits for categories.
Heteroscedasticity & non-constant variance: Trees are flexible enough to isolate regions of different variance, but they don't model variance explicitly unless you augment them.

Worked example snapshot — predicting house prices

Imagine data: {num_bedrooms, sqft, distance_to_center}. The tree might first split on sqft > 1,200. In the left region (small houses), it might split on distance_to_center; in the right region (large houses), it might split on bedrooms. The leaf predictions are means of prices for examples reaching those leaves. No smoothing — just clear neighborhood-level rules.

Why might this beat kNN here? Because the tree gives simple rules that segment price drivers, rather than averaging across potentially irrelevant neighbors.

When to choose a regression tree (short checklist)

You want model interpretability/rules.
You have mixed feature types and minimal preprocessing time.
You need a fast, low-friction baseline.
You’re okay with piecewise predictions or you’ll wrap the tree in an ensemble (coming up next).

Next step (teaser)

Single trees are great, but they’re unstable. Ensembles (Random Forests, Gradient Boosting) combine many trees to reduce variance and improve accuracy — the next thrilling act in our course.

Key takeaways (memorize these like tiny dramatic revelations)

Regression trees partition feature space and predict leaf means. They're simple, interpretable, and can capture nonlinearity with axis-aligned splits.
Splits are chosen by variance reduction (ΔMSE). Prediction = mean(y) in leaf.
Main weaknesses: high variance and non-smooth predictions — but these are exactly why ensembles exist.

Parting thought: Trees give you readable, human-friendly rules. If you want clinical, smooth curves, reach for kernels. If you want a rulebook that fits your messy dataset like a glove (sometimes a patchwork glove), start with a tree and then ensemble it if it gets dramatic.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics