jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Supervised Machine Learning: Regression and Classification
Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

Grid Search and Random SearchBayesian Optimization BasicsSuccessive Halving and HyperbandEarly Stopping and Warm StartsHyperparameter Spaces and PriorsPipeline Composition and CachingColumnTransformers for Heterogeneous DataCustom Transformers and EstimatorsCross-Validated PipelinesRefit Strategies and Model PersistenceReproducible Experiment TrackingLogging and Metadata ManagementParallel and Distributed TuningBudget-Aware OptimizationReusing and Sharing Artifacts

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Model Tuning, Pipelines, and Experiment Tracking

Model Tuning, Pipelines, and Experiment Tracking

19370 views

Automate workflows, search hyperparameters, and track experiments reproducibly.

Content

5 of 15

Hyperparameter Spaces and Priors

Sassy Priors & Hyperparameter Space Design
2216 views
intermediate
humorous
machine learning
education theory
gpt-5-mini
2216 views

Versions:

Sassy Priors & Hyperparameter Space Design

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Hyperparameter Spaces and Priors — The Chaotic Map You Learn to Love

"Pick your priors like you pick your coffee strength: too weak and nothing wakes up; too strong and you'll regret it halfway through the semester." — Your future hyperparameter-tuned self

You already know about Early Stopping, Warm Starts, Successive Halving, and Hyperband — those are our budget-savvy friends that help us stop training terrible models early and reuse work when sensible. You also just trimmed down your feature mess with dimensionality reduction and feature selection, reducing redundancy and improving signal. Great. Now we need to decide where the knobs live, what values they can take, and what beliefs (priors) we bring to the table when searching them. This is the art and science of hyperparameter spaces and priors.


Why hyperparameter spaces and priors matter

  • If you search a space that is too wide, you waste compute exploring absurd regions.
  • If you search a space that is too narrow, you miss the glory zone of performance.
  • If you pick the wrong distribution (prior), your optimizer (random, Bayesian, or bandit-style) may chase the wrong ghost.

Priors are not metaphysical; they are the probability distributions or sampling strategies we give the optimizer. Whether you're doing plain random search, Bayesian optimization, or launching a Hyperband run, the sampling policy is your prior belief about where good hyperparameters live.


Types of hyperparameters & canonical priors

  • Continuous (real) — e.g., learning rate, L2 penalty

    • Use log-uniform for scale parameters (learning rates, regularization strengths) because multiplicative changes matter more than additive ones.
    • Use normal or uniform if the parameter behaves linearly.
  • Integer — e.g., number of trees, depth, n_components (PCA)

    • Sample integer values; consider sampling a continuous space and rounding only when values are small or when scale matters logarithmically.
  • Categorical — e.g., optimizer = {sgd, adam}, activation = {relu, tanh}

    • Use discrete uniform or informed weights if you have prior preference.
  • Conditional — e.g., if model = XGBoost then tune max_depth, otherwise tune hidden_layers

    • Express conditions explicitly in your search space; many optimizers support nested/conditional spaces.

Quick reference table — common hyperparams and recommended priors

Hyperparameter Typical domain Recommended prior/transformation Why (intuition)
Learning rate (lr) (1e-6, 1) Log-uniform Multiplicative effects; 1e-3 vs 1e-4 matters more than 0.01 vs 0.02
L2 (alpha) (1e-8, 10) Log-uniform Regularization strength spans orders of magnitude
Number of trees (n_estimators) [10, 5000] Integer linear or log-scale Often more trees help but cost scales; log-sampling finds small/large quickly
Max depth [2, 50] Integer uniform Depth is discrete and often small values matter most
Dropout rate [0, 0.9] Beta(2,5) or clipped normal Probability in [0,1]; Beta allows skew towards small values
PCA components (n_components) [1, min(n_features, 300)] Integer linear Often linear search or informed by variance explained

Priors in practice: examples & pitfalls

1) The log-uniform salvation

If you try a Uniform(0.0001, 0.1) for learning rate, you'll bias toward large values numerically even though tiny values can be better. Use LogUniform(1e-6, 1e-1). In code, sample u ~ Uniform(log(a), log(b)) and set x = exp(u).

2) Beware of naive integer encoding

Don't treat categorical as integers (0,1,2) and feed them to algorithms that assume ordering. If you sample an integer and pass it to an algorithm that expects a categorical value, you accidentally imply an ordinal relationship.

3) Conditional spaces and wasted compute

If a hyperparameter only matters for one model or one stage of a pipeline, make it conditional. Running expensive evaluations for irrelevant parameters is a crime against the compute budget.

4) Your prior should reflect scale knowledge

If feature scaling or dimensionality reduction changed the signal (e.g., you reduced to 10 PCA components), that affects sensible ranges for parameters like max_features, hidden_layer sizes, or regularization — shrink ranges accordingly.


How priors interact with tuning methods you already know

  • Random Search: No model of the objective, so the prior is literally the sampling distribution. If you use log-uniform, random search will actually explore multiplicative scales properly.

  • Bayesian Optimization (BO): BO builds a surrogate model. The prior (initial distribution for random trials + any prior mean for the surrogate) influences where BO starts exploring. Use a few random draws that reflect your beliefs before BO goes greedy.

  • Successive Halving / Hyperband: These need brackets and resource allocation. If your prior thinks small models are good (e.g., low n_estimators / small hidden sizes), Hyperband will often keep those early and scale up promising ones. If your prior always samples huge models, you blow budget fast.

  • Warm Starts: If your model supports warm-starting (e.g., incremental estimators or continuing boosting rounds), structure the search so that parameter changes that do not require retraining from scratch can be explored more cheaply. For example, increase n_estimators incrementally and warm-start to try different learning rates — but be careful: some hyperparams (like max_depth) cannot be trivially warm-started.


Practical recipe: design a sensible hyperparameter space (step-by-step)

  1. Start with domain knowledge: what ranges made sense during manual dev? Use that as center.
  2. Transform scale parameters into log-space (lr, alpha). Use log-uniform sampling.
  3. Use informative priors if you have evidence (e.g., prior runs, literature). Otherwise use weak but sensible priors (e.g., Beta for probabilities favoring small dropout).
  4. Express conditionals explicitly (model-specific params). Keep the global space compact.
  5. Run a short random-search pilot (20–50 trials). Inspect results with experiment tracking; update priors.
  6. Move to BO or bandit methods with the updated priors and leverage warm-starts if valid.

Example: a compact Optuna-style space (pseudocode)

# Pseudocode / conceptual
with trial:
    model = trial.suggest_categorical('model', ['rf', 'xgb', 'mlp'])
    if model == 'rf':
        n_estimators = trial.suggest_int('rf__n_estimators', 50, 2000, log=True)
        max_depth = trial.suggest_int('rf__max_depth', 3, 40)
        max_features = trial.suggest_uniform('rf__max_features', 0.1, 1.0)
    elif model == 'xgb':
        lr = trial.suggest_loguniform('xgb__lr', 1e-5, 1e-1)
        max_depth = trial.suggest_int('xgb__max_depth', 3, 12)
    elif model == 'mlp':
        lr = trial.suggest_loguniform('mlp__lr', 1e-6, 1e-2)
        hidden = trial.suggest_int('mlp__hidden_units', 16, 1024, log=True)
        dropout = trial.suggest_beta('mlp__dropout', 2, 5)

Note: Many libraries don't have suggest_int(..., log=True); implement by sampling log and exponentiating.


Experiment tracking & reproducibility: record your priors

If your experiment tracking stores only hyperparameter values but not the prior/space definition, you won't be able to reproduce the search behavior later. Log:

  • The entire search space definition (bounds, transforms)
  • Seed(s) for pseudo-random samplers
  • Sampling strategy (random, BO, Hyperband) and settings (brackets, eta)

This ties back into the earlier module on experiment tracking: priors are part of your experiment design and must be versioned.


Final takeaways — TL;DR for the lazy (and brilliant)

  • Use log-scale for multiplicative parameters (learning rates, regularization). Treat probabilities with Beta, counts with integer ranges.
  • Make your hyperparameter space conditional and compact. No free-floating meaningless knobs.
  • Run a small pilot to learn a prior; then refine and escalate to BO/Hyperband. Record everything.
  • Remember: pruning strategies (Successive Halving, Hyperband) and warm starts can massively change which priors are efficient. Think about budget when designing priors.

Parting wisdom: designing hyperparameter spaces is half science, half game design. If your space is a chaotic monster, even the best optimizer will learn to be afraid of you. Be kind, be informed, and log everything — your future self will thank you.

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics