Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

Decision Trees for Regression Decision Trees for Classification Impurity and Splitting Criteria Pruning and Regularization of Trees Handling Missing Values in Trees Random Forests Essentials Extremely Randomized Trees Gradient Boosting Fundamentals Learning Rate, Depth, and Estimators XGBoost, LightGBM, and CatBoost Feature Importance and Permutation Partial Dependence and ICE with Trees Handling Imbalanced Data with Ensembles Calibration of Ensemble Predictions Stacking and Blending Strategies

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Tree-Based Models and Ensembles

Tree-Based Models and Ensembles

25073 views

Learn interpretable trees and powerful ensembles like random forests and gradient boosting.

Content

2 of 15

Decision Trees for Classification

Decision Trees but Make It Sass

6800 views

intermediate

humorous

machine learning

gpt-5-mini

6800 views

Versions:

Decision Trees but Make It Sass

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Decision Trees for Classification — The Chaotic, Charming Flowchart of Machine Learning

"If linear models are polite people who hold doors for you, decision trees are your friend's uncle: blunt, messy, and somehow always right at the backyard BBQ."

You just learned about regression trees earlier (nice work), and you've played with neighborhood and kernel ideas using k-NN and SVMs (remember how kernels warp space like taffy?). Now let’s pivot: classification trees cut the feature space into tidy-ish regions and hand out class labels like party favors. This builds naturally on your prior knowledge: where k-NN uses local neighborhoods and SVMs build global margins, decision trees partition space with axis-aligned rules that are easy to read and easy to overcook.

What is a classification tree, really?

Definition (informal): A decision tree is a flowchart-like structure that repeatedly splits the feature space based on feature thresholds to produce leaf nodes that vote for class labels.
Why use it? Interpretability, nonlinearity without kernels, handles mixed feature types, no need to scale features.

Quick anatomy

Root node: All data starts here.
Internal nodes: Questions like "Is feature X <= 7?".
Leaves: Final class prediction (often with probabilities from class proportions).

Imagine a bouncer at a club who checks one attribute at a time: shoes, ID, mood, then lets you in (class = "allowed") or out (class = "nope"). That sequential questioning is a decision tree.

How the tree decides where to split (entropy, Gini, information gain)

We choose splits that make children "purer" — i.e., nodes where most examples are of a single class.

Gini impurity: 1 - sum(p_i^2). Fast, commonly used in CART.
Entropy (information gain): -sum(p_i log p_i). A bit more sensitive to class probabilities.

Blockquote: "Pick a split that maximizes reduction in impurity — like choosing the argument that best settles an awkward family debate."

Table: quick compare

Criterion	Range	Notes
Gini impurity	0 (pure) to <1	Slightly prefers larger nodes; fast to compute
Entropy	0 (pure) to log(K)	Tighter theoretical link to information theory

Information gain = impurity(parent) - weighted_impurity(children).

Pseudocode: Growing a tree (CART-style)

function grow_tree(data, depth=0):
    if stopping_condition(data, depth):
        return leaf(p_class = class_proportions(data))
    best_split = argmax over features and thresholds of information_gain
    left, right = partition(data, best_split)
    node.left = grow_tree(left, depth+1)
    node.right = grow_tree(right, depth+1)
    return node

Stopping conditions include: max_depth, min_samples_split, min_samples_leaf, or pure node.

How classification trees differ from regression trees (quick tie-back)

Regression trees minimize squared error (or MAE) and predict a numeric value in leaves (we covered this in "Decision Trees for Regression").
Classification trees minimize impurity and predict class labels or class probabilities by leaf proportions.

So: same scaffolding, different loss function. The underlying algorithmic flow is your friend from the regression chapter — just swap the loss.

Probabilities from leaves and calibration

A leaf typically reports class probabilities as the fraction of training samples of each class in that leaf (e.g., 8/10 → 0.8). This is similar in spirit to probabilistic outputs from SVMs you saw earlier (Platt scaling / isotonic regression):

Trees give empirical probabilities (sample-based).
They can be poorly calibrated, especially if leaves are small or when boosting ensembles are involved.
You can calibrate outputs post hoc (Platt scaling, isotonic) — just like you would for SVM.

Question: If a leaf contains only one training sample, should we trust its probability? (Hint: nope — regularize or prune.)

Imbalanced classes & class weights (contrast with SVM)

You saw class weights in SVMs — same idea applies here.

Set class_weight to penalize misclassifying minority classes so impurity measures account for importance.
Or use sample_weight to upsample minority class observations.
Another trick: change splitting criterion to use weighted counts.

Quick analogy: SVM class weights move the margin; trees change what splits look important.

Overfitting, pruning, and regularization

Trees are greedy and will happily memorize noise if you let them:

Shallow trees = high bias, low variance.
Deep trees = low bias, high variance.

Regularization knobs:

max_depth
min_samples_split
min_samples_leaf
max_features (limit per split)
cost complexity pruning (post-pruning) — prune subtrees that don't reduce a penalized impurity

Pruning is your friend — it keeps the party from turning into a chaotic mosh pit.

Strengths and weaknesses (short and spicy)

Strengths:
- Interpretable as if someone wrote explanations for each prediction.
- Handles heterogeneous features and missing values nicely.
- No need for feature scaling.
Weaknesses:
- Axis-aligned splits can require many nodes to capture oblique boundaries.
- High variance (unstable) — tiny data changes => big tree changes.
- Tendency to overfit without constraints.

This instability is why we often move to ensembles (random forests and boosting), which we'll cover next — they borrow the single-tree intuition and fix its mood swings.

Visual toy example (mental image)

Imagine classifying fruit by two features: softness and sweetness. A tree might first split on softness (soft vs firm), then within "soft" split on sweetness, producing leaves like: soft+sweet = banana, soft+not sweet = pear, firm+... = apple. That’s axis-aligned logic: one test at a time.

Try this quick thought experiment: how many axis-aligned splits would you need to approximate a circular decision boundary? (Answer: many — trees approximate with staircase steps; kernels warp space more elegantly.)

Quick checklist when using classification trees

Do you care about interpretability? If yes, trees are great.
Is your dataset noisy? Consider pruning or using ensembles.
Are classes imbalanced? Use class_weight or sample reweighting.
Need calibrated probabilities? Consider post-hoc calibration.
Want better performance? Try random forests or boosting.

Closing mic drop

Decision trees are the readable, greedy, slightly reckless sibling in the family of classifiers. They bridge the gap between local methods (k-NN), global margin models (SVM), and ensemble-based stability. Use them when you want explanations and quick, nonlinear decision-making — but remember: if your tree grows a personality, prune it before it starts giving out wrong life advice.

Key takeaways:

Trees split to reduce impurity (Gini/Entropy).
Leaves give class proportions as probabilities — may need calibration.
Control depth and samples per leaf to avoid overfitting.
Use class/sample weights for imbalanced data (like you did with SVM).

Next up: we'll glue many trees together (ensembles) to keep the interpretability but minimize the drama. Ready for a forest? 🌲🌲🌲

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics