Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

Decision Trees for Regression Decision Trees for Classification Impurity and Splitting Criteria Pruning and Regularization of Trees Handling Missing Values in Trees Random Forests Essentials Extremely Randomized Trees Gradient Boosting Fundamentals Learning Rate, Depth, and Estimators XGBoost, LightGBM, and CatBoost Feature Importance and Permutation Partial Dependence and ICE with Trees Handling Imbalanced Data with Ensembles Calibration of Ensemble Predictions Stacking and Blending Strategies

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Tree-Based Models and Ensembles

Tree-Based Models and Ensembles

25073 views

Learn interpretable trees and powerful ensembles like random forests and gradient boosting.

Content

3 of 15

Impurity and Splitting Criteria

Impurity but Make It Dramatic

5252 views

intermediate

humorous

visual

science

gpt-5-mini

5252 views

Versions:

Impurity but Make It Dramatic

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Impurity and Splitting Criteria — The Soul of a Decision Tree (and why it sometimes overreacts)

"A tree doesn't learn by staring at data — it learns by repeatedly asking, 'Is this split better than chaos?'"

You're already comfortable with Decision Trees for Regression and Classification (we covered structure, leaves, and predictions). You also remember kNN and kernels from Distance- and Kernel-Based Methods — those methods carve space with local neighborhoods or smooth decision boundaries. Trees take a different vibe: they greedily partition feature space into axis-aligned regions. The question at each node is: what is the best question to ask next? That question is decided by an impurity measure and its reduction.

Quick elevator pitch

Impurity: a scalar that quantifies how "mixed" the labels are in a node. Low impurity = purer node (ideally one label).
Splitting criterion: choose the feature and threshold that maximize the impurity decrease (i.e., parent impurity minus weighted child impurities).

Why this matters: the impurity you optimize determines how the tree grows, how it generalizes, and how it interacts with ensembles like Random Forests or Boosting.

The Usual Suspects: Classification

1) Gini impurity

Formula:
Gini(p) = 1 - sum_{k} p_k^2
Interpretation: probability of mislabelling if you randomly pick a label according to class proportions.
Fast, smooth, and commonly used (scikit-learn default). Works great in practice.

2) Entropy (Information Gain)

Formula:
Entropy(p) = - sum_{k} p_k log2(p_k)
Interpretation: average information (bits) required to encode the class label. Splitting that reduces entropy is information gain.
Often yields similar splits to Gini. Slightly more sensitive to changes near p=0 or 1.

3) Misclassification error (a blunt instrument)

Formula:
Error(p) = 1 - max_k p_k
Interpretation: fraction of samples that would be misclassified if you labeled the node with the majority class.
Not great for splitting because it's less sensitive to changes in class proportions — usually used for pruning/inspection rather than split choice.

How splits are evaluated (classification)

For a candidate split s that produces left child L and right child R:

ImpurityDecrease(s) = I(parent) - (|L|/|parent|) I(L) - (|R|/|parent|) I(R)

Choose s that maximizes this. Note the weighted child impurities — larger child matters more.

Regression impurity: variance and friends

For regression trees, impurity commonly measures spread of y-values inside a node.

Variance / MSE (most common):
Var = (1/n) sum (y_i - mean_y)^2
MAE-based splits: use absolute deviations (median-based). Less common because optimizing absolute loss for split search is trickier.

Split quality: reduction in variance (parent variance minus weighted child variances). That tells the tree how well a split concentrates similar y-values.

Practical notes: how do we search splits?

For continuous features: sort unique values and consider thresholds at midpoints between consecutive distinct values. Complexity: O(n log n) per feature if sorted; faster with histograms and approximate quantiles.
For categorical features: either test one-vs-rest or, for small cardinality, test all subset splits (2^{k-1}-1 possibilities) — explosive if many categories. Often we encode or use heuristics.
Many libraries (LightGBM, XGBoost) use histogram/quantile binning to make this fast and memory friendly.

Pseudocode (very high-level):

for feature in features:
  sort unique feature values
  for threshold in midpoints:
    partition data into L, R
    compute impurity decrease
choose feature, threshold with max decrease

Connection to kNN/SVM (remember those neighbors and kernels?)

kNN and kernel methods make local decisions: kNN averages labels of neighbors; kernel methods weigh points by smooth kernels. Trees are adaptive partitioners — they carve the space into blocks where predictions are constant (or average). Think step functions vs smooth kernels.
Splitting criteria are a way to ask: "Where is the signal strong enough to create a new rectangular neighborhood?" While kNN asks: "Given a point, who are its neighbors?" Trees are global and axis-aligned; kernels & SVMs are global but can create curved boundaries.
Ensembles (Random Forests) restore some locality by averaging many adaptive partitions — kind of like smoothing many different piecewise-constant estimates, producing a function that behaves more like a localized smoother.

Which impurity should you pick? (Practical cheat sheet)

Task	Typical impurity	Why
Classification (general)	Gini	Fast and effective in practice
Classification (probabilities/interpretability)	Entropy	Aligns with information theory; slightly different sensitivity
Classification (pruning only)	Misclassification error	Useful for final node label selection, not split search
Regression	Variance / MSE	Directly minimizes squared prediction error; default in many libs
Robust regression	MAE-based	For heavy tails and outliers, but harder to optimize for split search

Quick rule of thumb: if you don’t have a reason to change it, use Gini for classification and variance for regression.

Subtleties, gotchas, and power plays

Greedy nature: a split that looks optimal locally may be suboptimal globally. Trees are greedy — hence ensembles (bagging, boosting) exist to fix that.
Imbalanced classes: impurity measures can be dominated by the majority class; consider class weights, stratified splits, or different metrics (AUC-based objectives in boosting).
Overfitting: maximizing impurity reduction can lead to tiny pure leaves with noisy decisions. Use depth limits, min_samples_leaf, or pruning.
Feature scale and axis-aligned splits: trees don't care about scaling but are limited to axis-aligned partitions. If your signal is diagonal in feature space (like XOR rotated), trees need many splits.

Ensembles and impurity — why tiny differences matter

Random Forests: build many deep trees (often unpruned) and average. If you use Gini vs Entropy, trees differ slightly — but averaging washes out many differences. More important: randomness (bootstrap + feature subsampling) reduces correlation.
Boosting (e.g., XGBoost, LightGBM): fits trees to residuals or gradients derived from a loss function (e.g., squared error for regression, logistic deviance for classification). Here, impurity reduction is measured relative to the loss gradient; the split criterion essentially maximizes the gain in objective (not always classic Gini/Entropy).

Closing: TL;DR + Actionable Takeaways

Impurity = how mixed a node is. Splitting = pick the split that maximizes impurity decrease.
For classification: Gini and Entropy are the stars; misclassification error is blunt and rarely used for split search.
For regression: variance (MSE) reduction is the standard impurity.
Trees = axis-aligned adaptive partitioners. Compare to kNN/kernel methods which are local or smooth. Ensembles blend many trees to reduce variance or bias.

Final power tip: If your tree wildly overfits (memorizes a training set like a judge memorizing trivia), soften it with min_samples_leaf, max_depth, or switch to an ensemble — and remember: impurity choice often nudges behavior but tuning complexity regularization and feature handling moves mountains.

"Pick a sensible impurity, regularize hard, and when in doubt—ensemble it."

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics