jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Supervised Machine Learning: Regression and Classification
Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

Decision Trees for RegressionDecision Trees for ClassificationImpurity and Splitting CriteriaPruning and Regularization of TreesHandling Missing Values in TreesRandom Forests EssentialsExtremely Randomized TreesGradient Boosting FundamentalsLearning Rate, Depth, and EstimatorsXGBoost, LightGBM, and CatBoostFeature Importance and PermutationPartial Dependence and ICE with TreesHandling Imbalanced Data with EnsemblesCalibration of Ensemble PredictionsStacking and Blending Strategies

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Tree-Based Models and Ensembles

Tree-Based Models and Ensembles

25059 views

Learn interpretable trees and powerful ensembles like random forests and gradient boosting.

Content

3 of 15

Impurity and Splitting Criteria

Impurity but Make It Dramatic
5249 views
intermediate
humorous
visual
science
gpt-5-mini
5249 views

Versions:

Impurity but Make It Dramatic

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Impurity and Splitting Criteria — The Soul of a Decision Tree (and why it sometimes overreacts)

"A tree doesn't learn by staring at data — it learns by repeatedly asking, 'Is this split better than chaos?'"

You're already comfortable with Decision Trees for Regression and Classification (we covered structure, leaves, and predictions). You also remember kNN and kernels from Distance- and Kernel-Based Methods — those methods carve space with local neighborhoods or smooth decision boundaries. Trees take a different vibe: they greedily partition feature space into axis-aligned regions. The question at each node is: what is the best question to ask next? That question is decided by an impurity measure and its reduction.


Quick elevator pitch

  • Impurity: a scalar that quantifies how "mixed" the labels are in a node. Low impurity = purer node (ideally one label).
  • Splitting criterion: choose the feature and threshold that maximize the impurity decrease (i.e., parent impurity minus weighted child impurities).

Why this matters: the impurity you optimize determines how the tree grows, how it generalizes, and how it interacts with ensembles like Random Forests or Boosting.


The Usual Suspects: Classification

1) Gini impurity

  • Formula:
    Gini(p) = 1 - sum_{k} p_k^2

  • Interpretation: probability of mislabelling if you randomly pick a label according to class proportions.

  • Fast, smooth, and commonly used (scikit-learn default). Works great in practice.

2) Entropy (Information Gain)

  • Formula:
    Entropy(p) = - sum_{k} p_k log2(p_k)

  • Interpretation: average information (bits) required to encode the class label. Splitting that reduces entropy is information gain.

  • Often yields similar splits to Gini. Slightly more sensitive to changes near p=0 or 1.

3) Misclassification error (a blunt instrument)

  • Formula:
    Error(p) = 1 - max_k p_k

  • Interpretation: fraction of samples that would be misclassified if you labeled the node with the majority class.

  • Not great for splitting because it's less sensitive to changes in class proportions — usually used for pruning/inspection rather than split choice.

How splits are evaluated (classification)

For a candidate split s that produces left child L and right child R:

ImpurityDecrease(s) = I(parent) - (|L|/|parent|) I(L) - (|R|/|parent|) I(R)

Choose s that maximizes this. Note the weighted child impurities — larger child matters more.


Regression impurity: variance and friends

For regression trees, impurity commonly measures spread of y-values inside a node.

  • Variance / MSE (most common):
    Var = (1/n) sum (y_i - mean_y)^2

  • MAE-based splits: use absolute deviations (median-based). Less common because optimizing absolute loss for split search is trickier.

Split quality: reduction in variance (parent variance minus weighted child variances). That tells the tree how well a split concentrates similar y-values.


Practical notes: how do we search splits?

  • For continuous features: sort unique values and consider thresholds at midpoints between consecutive distinct values. Complexity: O(n log n) per feature if sorted; faster with histograms and approximate quantiles.
  • For categorical features: either test one-vs-rest or, for small cardinality, test all subset splits (2^{k-1}-1 possibilities) — explosive if many categories. Often we encode or use heuristics.
  • Many libraries (LightGBM, XGBoost) use histogram/quantile binning to make this fast and memory friendly.

Pseudocode (very high-level):

for feature in features:
  sort unique feature values
  for threshold in midpoints:
    partition data into L, R
    compute impurity decrease
choose feature, threshold with max decrease

Connection to kNN/SVM (remember those neighbors and kernels?)

  • kNN and kernel methods make local decisions: kNN averages labels of neighbors; kernel methods weigh points by smooth kernels. Trees are adaptive partitioners — they carve the space into blocks where predictions are constant (or average). Think step functions vs smooth kernels.
  • Splitting criteria are a way to ask: "Where is the signal strong enough to create a new rectangular neighborhood?" While kNN asks: "Given a point, who are its neighbors?" Trees are global and axis-aligned; kernels & SVMs are global but can create curved boundaries.
  • Ensembles (Random Forests) restore some locality by averaging many adaptive partitions — kind of like smoothing many different piecewise-constant estimates, producing a function that behaves more like a localized smoother.

Which impurity should you pick? (Practical cheat sheet)

Task Typical impurity Why
Classification (general) Gini Fast and effective in practice
Classification (probabilities/interpretability) Entropy Aligns with information theory; slightly different sensitivity
Classification (pruning only) Misclassification error Useful for final node label selection, not split search
Regression Variance / MSE Directly minimizes squared prediction error; default in many libs
Robust regression MAE-based For heavy tails and outliers, but harder to optimize for split search

Quick rule of thumb: if you don’t have a reason to change it, use Gini for classification and variance for regression.


Subtleties, gotchas, and power plays

  • Greedy nature: a split that looks optimal locally may be suboptimal globally. Trees are greedy — hence ensembles (bagging, boosting) exist to fix that.
  • Imbalanced classes: impurity measures can be dominated by the majority class; consider class weights, stratified splits, or different metrics (AUC-based objectives in boosting).
  • Overfitting: maximizing impurity reduction can lead to tiny pure leaves with noisy decisions. Use depth limits, min_samples_leaf, or pruning.
  • Feature scale and axis-aligned splits: trees don't care about scaling but are limited to axis-aligned partitions. If your signal is diagonal in feature space (like XOR rotated), trees need many splits.

Ensembles and impurity — why tiny differences matter

  • Random Forests: build many deep trees (often unpruned) and average. If you use Gini vs Entropy, trees differ slightly — but averaging washes out many differences. More important: randomness (bootstrap + feature subsampling) reduces correlation.
  • Boosting (e.g., XGBoost, LightGBM): fits trees to residuals or gradients derived from a loss function (e.g., squared error for regression, logistic deviance for classification). Here, impurity reduction is measured relative to the loss gradient; the split criterion essentially maximizes the gain in objective (not always classic Gini/Entropy).

Closing: TL;DR + Actionable Takeaways

  • Impurity = how mixed a node is. Splitting = pick the split that maximizes impurity decrease.
  • For classification: Gini and Entropy are the stars; misclassification error is blunt and rarely used for split search.
  • For regression: variance (MSE) reduction is the standard impurity.
  • Trees = axis-aligned adaptive partitioners. Compare to kNN/kernel methods which are local or smooth. Ensembles blend many trees to reduce variance or bias.

Final power tip: If your tree wildly overfits (memorizes a training set like a judge memorizing trivia), soften it with min_samples_leaf, max_depth, or switch to an ensemble — and remember: impurity choice often nudges behavior but tuning complexity regularization and feature handling moves mountains.

"Pick a sensible impurity, regularize hard, and when in doubt—ensemble it."

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics