Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

Grid Search and Random Search Bayesian Optimization Basics Successive Halving and Hyperband Early Stopping and Warm Starts Hyperparameter Spaces and Priors Pipeline Composition and Caching ColumnTransformers for Heterogeneous Data Custom Transformers and Estimators Cross-Validated Pipelines Refit Strategies and Model Persistence Reproducible Experiment Tracking Logging and Metadata Management Parallel and Distributed Tuning Budget-Aware Optimization Reusing and Sharing Artifacts

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Model Tuning, Pipelines, and Experiment Tracking

Model Tuning, Pipelines, and Experiment Tracking

19387 views

Automate workflows, search hyperparameters, and track experiments reproducibly.

Content

3 of 15

Successive Halving and Hyperband

Tournament Tuning: Successive Halving & Hyperband (Chaotic TA Edition)

6466 views

intermediate

humorous

visual

science

gpt-5-mini

6466 views

Versions:

Tournament Tuning: Successive Halving & Hyperband (Chaotic TA Edition)

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Successive Halving and Hyperband — Tournament Tuning for the Impatient Data Scientist

Imagine a gladiator arena where bad hyperparameter configurations get mercilessly eliminated after a few rounds, while the promising ones get more training time, more data, and maybe a pep talk. That arena is Successive Halving. Hyperband is the whole stadium.

You already know the classics: Grid Search is exhaustive but slow, Random Search is surprisingly effective, and Bayesian Optimization tries to be clever about where to search next. Successive Halving (SH) and Hyperband join the party by asking a delightfully practical question: what if we could give lots of candidates a little attention first, kill most of them quickly, and only invest heavily in the ones that actually look promising?

This lesson builds on our prior chat about Grid/Random Search and Bayesian Optimization, and it slots naturally after Dimensionality Reduction and Feature Selection: once you reduce the feature clutter and focus on signal, you still need to tune models efficiently. SH/Hyperband are perfect when training budgets are limited and you have many configurations to try.

TL;DR in one savage sentence

Successive Halving is a resource-aware early-stopping tournament for hyperparameter configurations. Hyperband runs many such tournaments with different aggressiveness settings to balance breadth vs depth. Use them when you can evaluate cheap low-fidelity approximations (few epochs, small subset of data, fewer features).

The core idea (aka the elegant cruelty)

Start with N hyperparameter configurations. Give each a small budget r (e.g., 1 epoch, 10% of data).
Evaluate and keep only the top 1/eta (commonly eta=3 or 4). Increase the budget for the survivors and repeat.
After log_eta(N) rounds, you have a few well-trained candidates that got the lion's share of time.

Successive Halving = speed + focus. Hyperband = repeated SH with different initial N and r to hedge bets.

Why not just train everything fully? Because most configurations are garbage early. Spending 90% of your time on losers is inefficient. SH admits this and reallocates resources dynamically.

Where does the budget come from? (aka the multi-fidelity trick)

The magic requires a cheap approximation of final performance. Typical fidelity choices:

Number of training epochs (most common for deep learning)
Fraction of training data (train on 10%, 30%, 100%)
Number of features (use feature selection or PCA to make smaller inputs)
Model size (train a smaller network first)

This dovetails nicely with Dimensionality Reduction and Feature Selection: you can treat number of features as the fidelity. Start tuning on 20 features, then on 100, then on full set.

Hyperband: bracketed tournaments

Hyperband runs multiple Successive Halvings but with different starting points. Intuition: one SH run with a very large N and tiny r explores widely but shallowly; one run with small N and large r explores deeply. Hyperband runs both and everything in between.

Key parameters: total max resource R per configuration, and the reduction factor eta.

Pseudocode (simplified):

for s in reversed(range(0, s_max+1)):   # different brackets
    n = ceil((s_max+1)/(s+1) * eta**s)  # configs to sample
    r = R * eta**(-s)                   # initial budget
    run_successive_halving(n, r)

And run_successive_halving basically repeats: evaluate n configs with budget r, keep top floor(n/eta), set r *= eta, repeat.

How this compares to what you learned before

Strategy	Strengths	When to use it	Relation to Bayesian / Grid / Random
Grid Search	Systematic, simple	Low-dim, budget ok	Very wasteful in high-dim
Random Search	Cheap, surprisingly strong	High-dim, cheap evals	Baseline to beat
Bayesian Opt	Sample-efficient	Expensive evals, few dims	Learns model of loss surface
Successive Halving	Fast, resource-aware	Many configs, cheap low-fidelity	Great with Random for proposals
Hyperband	Robust breadth-depth tradeoff	Huge search spaces, variable budgets	Can be combined with Bayesian (BOHB)

Note: BOHB = Bayesian Optimization + Hyperband, marrying the best of both worlds.

Practical gotchas and tips

Choose eta = 3 or 4 as starting points. Larger eta is more aggressive (fewer survivors per round).
Define resource carefully. If using epochs, make sure final R is large enough to converge.
Use stalls: if training each config is noisy, use multiple seeds per eval or larger validation sets at higher budgets.
Logging matters: record budget, validation metric, training time. This is essential for experiment tracking and reproducibility.
Pipelines: put SH/Hyperband inside your pipeline's hyperparameter search step. If your pipeline has expensive preprocessing steps, consider caching or moving them outside the inner loop.
Feature selection interplay: you can make the fidelity be number of features. SH will test many feature counts cheaply and focus on promising ones.

Sklearn note: sklearn has HalvingGridSearchCV and HalvingRandomSearchCV which implement successive halving style search within scikit-learn's API. For Hyperband, look to libraries like Ray Tune, Optuna (with ASHA), or custom wrappers.

When to prefer SH/Hyperband vs Bayesian

Ask yourself:

Do I have a cheap low-fidelity approximation? Yes -> SH/Hyperband is great.
Are evaluations extremely expensive and smooth, and I can only run a few? Bayesian might be better.
Want to try thousands of configs quickly? Hyperband wins.
Want sample efficiency and global modeling? Consider BOHB or ASHA + Bayesian suggestions.

Quick checklist before you run Hyperband

Decide the fidelity (epochs, data fraction, features).
Set R (max resource) to something that would yield convergence if you ran it fully.
Pick eta (3 recommended) and compute s_max = floor(log_eta(R)).
Use Random sampling for proposals across brackets, or plug in Bayesian proposals for BOHB.
Log everything to your experiment tracker: budgets, metric trajectories, seeds, runtime.

Closing mic drop

Successive Halving and Hyperband are the pragmatic, no-nonsense answer to modern hyperparameter tuning when time and compute are finite. They elevate the strategy from "train everything forever" to "train smart: quickly dismiss the unpromising and invest in the likely winners". When combined with good feature selection and clean pipelines, they can take you from random guesswork to surgical tuning — and your model gets better without you getting bored and breaking the coffee machine.

Final challenge question: imagine your fidelity is proportion of features kept after a supervised feature selection step. How would you design the resource schedule so that early rounds focus on coarse signal, and later rounds refine subtle interactions? (Hint: start small, then geometrically increase fraction until you hit the full feature set.)

Happy bracket tuning. May your best hyperparameters survive the arena.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics