jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Supervised Machine Learning: Regression and Classification
Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

Grid Search and Random SearchBayesian Optimization BasicsSuccessive Halving and HyperbandEarly Stopping and Warm StartsHyperparameter Spaces and PriorsPipeline Composition and CachingColumnTransformers for Heterogeneous DataCustom Transformers and EstimatorsCross-Validated PipelinesRefit Strategies and Model PersistenceReproducible Experiment TrackingLogging and Metadata ManagementParallel and Distributed TuningBudget-Aware OptimizationReusing and Sharing Artifacts

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Model Tuning, Pipelines, and Experiment Tracking

Model Tuning, Pipelines, and Experiment Tracking

19370 views

Automate workflows, search hyperparameters, and track experiments reproducibly.

Content

3 of 15

Successive Halving and Hyperband

Tournament Tuning: Successive Halving & Hyperband (Chaotic TA Edition)
6464 views
intermediate
humorous
visual
science
gpt-5-mini
6464 views

Versions:

Tournament Tuning: Successive Halving & Hyperband (Chaotic TA Edition)

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Successive Halving and Hyperband — Tournament Tuning for the Impatient Data Scientist

Imagine a gladiator arena where bad hyperparameter configurations get mercilessly eliminated after a few rounds, while the promising ones get more training time, more data, and maybe a pep talk. That arena is Successive Halving. Hyperband is the whole stadium.

You already know the classics: Grid Search is exhaustive but slow, Random Search is surprisingly effective, and Bayesian Optimization tries to be clever about where to search next. Successive Halving (SH) and Hyperband join the party by asking a delightfully practical question: what if we could give lots of candidates a little attention first, kill most of them quickly, and only invest heavily in the ones that actually look promising?

This lesson builds on our prior chat about Grid/Random Search and Bayesian Optimization, and it slots naturally after Dimensionality Reduction and Feature Selection: once you reduce the feature clutter and focus on signal, you still need to tune models efficiently. SH/Hyperband are perfect when training budgets are limited and you have many configurations to try.


TL;DR in one savage sentence

Successive Halving is a resource-aware early-stopping tournament for hyperparameter configurations. Hyperband runs many such tournaments with different aggressiveness settings to balance breadth vs depth. Use them when you can evaluate cheap low-fidelity approximations (few epochs, small subset of data, fewer features).


The core idea (aka the elegant cruelty)

  • Start with N hyperparameter configurations. Give each a small budget r (e.g., 1 epoch, 10% of data).
  • Evaluate and keep only the top 1/eta (commonly eta=3 or 4). Increase the budget for the survivors and repeat.
  • After log_eta(N) rounds, you have a few well-trained candidates that got the lion's share of time.

Successive Halving = speed + focus. Hyperband = repeated SH with different initial N and r to hedge bets.

Why not just train everything fully? Because most configurations are garbage early. Spending 90% of your time on losers is inefficient. SH admits this and reallocates resources dynamically.


Where does the budget come from? (aka the multi-fidelity trick)

The magic requires a cheap approximation of final performance. Typical fidelity choices:

  • Number of training epochs (most common for deep learning)
  • Fraction of training data (train on 10%, 30%, 100%)
  • Number of features (use feature selection or PCA to make smaller inputs)
  • Model size (train a smaller network first)

This dovetails nicely with Dimensionality Reduction and Feature Selection: you can treat number of features as the fidelity. Start tuning on 20 features, then on 100, then on full set.


Hyperband: bracketed tournaments

Hyperband runs multiple Successive Halvings but with different starting points. Intuition: one SH run with a very large N and tiny r explores widely but shallowly; one run with small N and large r explores deeply. Hyperband runs both and everything in between.

Key parameters: total max resource R per configuration, and the reduction factor eta.

Pseudocode (simplified):

for s in reversed(range(0, s_max+1)):   # different brackets
    n = ceil((s_max+1)/(s+1) * eta**s)  # configs to sample
    r = R * eta**(-s)                   # initial budget
    run_successive_halving(n, r)

And run_successive_halving basically repeats: evaluate n configs with budget r, keep top floor(n/eta), set r *= eta, repeat.


How this compares to what you learned before

Strategy Strengths When to use it Relation to Bayesian / Grid / Random
Grid Search Systematic, simple Low-dim, budget ok Very wasteful in high-dim
Random Search Cheap, surprisingly strong High-dim, cheap evals Baseline to beat
Bayesian Opt Sample-efficient Expensive evals, few dims Learns model of loss surface
Successive Halving Fast, resource-aware Many configs, cheap low-fidelity Great with Random for proposals
Hyperband Robust breadth-depth tradeoff Huge search spaces, variable budgets Can be combined with Bayesian (BOHB)

Note: BOHB = Bayesian Optimization + Hyperband, marrying the best of both worlds.


Practical gotchas and tips

  • Choose eta = 3 or 4 as starting points. Larger eta is more aggressive (fewer survivors per round).
  • Define resource carefully. If using epochs, make sure final R is large enough to converge.
  • Use stalls: if training each config is noisy, use multiple seeds per eval or larger validation sets at higher budgets.
  • Logging matters: record budget, validation metric, training time. This is essential for experiment tracking and reproducibility.
  • Pipelines: put SH/Hyperband inside your pipeline's hyperparameter search step. If your pipeline has expensive preprocessing steps, consider caching or moving them outside the inner loop.
  • Feature selection interplay: you can make the fidelity be number of features. SH will test many feature counts cheaply and focus on promising ones.

Sklearn note: sklearn has HalvingGridSearchCV and HalvingRandomSearchCV which implement successive halving style search within scikit-learn's API. For Hyperband, look to libraries like Ray Tune, Optuna (with ASHA), or custom wrappers.


When to prefer SH/Hyperband vs Bayesian

Ask yourself:

  • Do I have a cheap low-fidelity approximation? Yes -> SH/Hyperband is great.
  • Are evaluations extremely expensive and smooth, and I can only run a few? Bayesian might be better.
  • Want to try thousands of configs quickly? Hyperband wins.
  • Want sample efficiency and global modeling? Consider BOHB or ASHA + Bayesian suggestions.

Quick checklist before you run Hyperband

  1. Decide the fidelity (epochs, data fraction, features).
  2. Set R (max resource) to something that would yield convergence if you ran it fully.
  3. Pick eta (3 recommended) and compute s_max = floor(log_eta(R)).
  4. Use Random sampling for proposals across brackets, or plug in Bayesian proposals for BOHB.
  5. Log everything to your experiment tracker: budgets, metric trajectories, seeds, runtime.

Closing mic drop

Successive Halving and Hyperband are the pragmatic, no-nonsense answer to modern hyperparameter tuning when time and compute are finite. They elevate the strategy from "train everything forever" to "train smart: quickly dismiss the unpromising and invest in the likely winners". When combined with good feature selection and clean pipelines, they can take you from random guesswork to surgical tuning — and your model gets better without you getting bored and breaking the coffee machine.

Final challenge question: imagine your fidelity is proportion of features kept after a supervised feature selection step. How would you design the resource schedule so that early rounds focus on coarse signal, and later rounds refine subtle interactions? (Hint: start small, then geometrically increase fraction until you hit the full feature set.)

Happy bracket tuning. May your best hyperparameters survive the arena.

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics