Model Tuning, Pipelines, and Experiment Tracking
Automate workflows, search hyperparameters, and track experiments reproducibly.
Content
Hyperparameter Spaces and Priors
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Hyperparameter Spaces and Priors — The Chaotic Map You Learn to Love
"Pick your priors like you pick your coffee strength: too weak and nothing wakes up; too strong and you'll regret it halfway through the semester." — Your future hyperparameter-tuned self
You already know about Early Stopping, Warm Starts, Successive Halving, and Hyperband — those are our budget-savvy friends that help us stop training terrible models early and reuse work when sensible. You also just trimmed down your feature mess with dimensionality reduction and feature selection, reducing redundancy and improving signal. Great. Now we need to decide where the knobs live, what values they can take, and what beliefs (priors) we bring to the table when searching them. This is the art and science of hyperparameter spaces and priors.
Why hyperparameter spaces and priors matter
- If you search a space that is too wide, you waste compute exploring absurd regions.
- If you search a space that is too narrow, you miss the glory zone of performance.
- If you pick the wrong distribution (prior), your optimizer (random, Bayesian, or bandit-style) may chase the wrong ghost.
Priors are not metaphysical; they are the probability distributions or sampling strategies we give the optimizer. Whether you're doing plain random search, Bayesian optimization, or launching a Hyperband run, the sampling policy is your prior belief about where good hyperparameters live.
Types of hyperparameters & canonical priors
Continuous (real) — e.g., learning rate, L2 penalty
- Use log-uniform for scale parameters (learning rates, regularization strengths) because multiplicative changes matter more than additive ones.
- Use normal or uniform if the parameter behaves linearly.
Integer — e.g., number of trees, depth, n_components (PCA)
- Sample integer values; consider sampling a continuous space and rounding only when values are small or when scale matters logarithmically.
Categorical — e.g., optimizer = {sgd, adam}, activation = {relu, tanh}
- Use discrete uniform or informed weights if you have prior preference.
Conditional — e.g., if model = XGBoost then tune max_depth, otherwise tune hidden_layers
- Express conditions explicitly in your search space; many optimizers support nested/conditional spaces.
Quick reference table — common hyperparams and recommended priors
| Hyperparameter | Typical domain | Recommended prior/transformation | Why (intuition) |
|---|---|---|---|
| Learning rate (lr) | (1e-6, 1) | Log-uniform | Multiplicative effects; 1e-3 vs 1e-4 matters more than 0.01 vs 0.02 |
| L2 (alpha) | (1e-8, 10) | Log-uniform | Regularization strength spans orders of magnitude |
| Number of trees (n_estimators) | [10, 5000] | Integer linear or log-scale | Often more trees help but cost scales; log-sampling finds small/large quickly |
| Max depth | [2, 50] | Integer uniform | Depth is discrete and often small values matter most |
| Dropout rate | [0, 0.9] | Beta(2,5) or clipped normal | Probability in [0,1]; Beta allows skew towards small values |
| PCA components (n_components) | [1, min(n_features, 300)] | Integer linear | Often linear search or informed by variance explained |
Priors in practice: examples & pitfalls
1) The log-uniform salvation
If you try a Uniform(0.0001, 0.1) for learning rate, you'll bias toward large values numerically even though tiny values can be better. Use LogUniform(1e-6, 1e-1). In code, sample u ~ Uniform(log(a), log(b)) and set x = exp(u).
2) Beware of naive integer encoding
Don't treat categorical as integers (0,1,2) and feed them to algorithms that assume ordering. If you sample an integer and pass it to an algorithm that expects a categorical value, you accidentally imply an ordinal relationship.
3) Conditional spaces and wasted compute
If a hyperparameter only matters for one model or one stage of a pipeline, make it conditional. Running expensive evaluations for irrelevant parameters is a crime against the compute budget.
4) Your prior should reflect scale knowledge
If feature scaling or dimensionality reduction changed the signal (e.g., you reduced to 10 PCA components), that affects sensible ranges for parameters like max_features, hidden_layer sizes, or regularization — shrink ranges accordingly.
How priors interact with tuning methods you already know
Random Search: No model of the objective, so the prior is literally the sampling distribution. If you use log-uniform, random search will actually explore multiplicative scales properly.
Bayesian Optimization (BO): BO builds a surrogate model. The prior (initial distribution for random trials + any prior mean for the surrogate) influences where BO starts exploring. Use a few random draws that reflect your beliefs before BO goes greedy.
Successive Halving / Hyperband: These need brackets and resource allocation. If your prior thinks small models are good (e.g., low n_estimators / small hidden sizes), Hyperband will often keep those early and scale up promising ones. If your prior always samples huge models, you blow budget fast.
Warm Starts: If your model supports warm-starting (e.g., incremental estimators or continuing boosting rounds), structure the search so that parameter changes that do not require retraining from scratch can be explored more cheaply. For example, increase n_estimators incrementally and warm-start to try different learning rates — but be careful: some hyperparams (like max_depth) cannot be trivially warm-started.
Practical recipe: design a sensible hyperparameter space (step-by-step)
- Start with domain knowledge: what ranges made sense during manual dev? Use that as center.
- Transform scale parameters into log-space (lr, alpha). Use log-uniform sampling.
- Use informative priors if you have evidence (e.g., prior runs, literature). Otherwise use weak but sensible priors (e.g., Beta for probabilities favoring small dropout).
- Express conditionals explicitly (model-specific params). Keep the global space compact.
- Run a short random-search pilot (20–50 trials). Inspect results with experiment tracking; update priors.
- Move to BO or bandit methods with the updated priors and leverage warm-starts if valid.
Example: a compact Optuna-style space (pseudocode)
# Pseudocode / conceptual
with trial:
model = trial.suggest_categorical('model', ['rf', 'xgb', 'mlp'])
if model == 'rf':
n_estimators = trial.suggest_int('rf__n_estimators', 50, 2000, log=True)
max_depth = trial.suggest_int('rf__max_depth', 3, 40)
max_features = trial.suggest_uniform('rf__max_features', 0.1, 1.0)
elif model == 'xgb':
lr = trial.suggest_loguniform('xgb__lr', 1e-5, 1e-1)
max_depth = trial.suggest_int('xgb__max_depth', 3, 12)
elif model == 'mlp':
lr = trial.suggest_loguniform('mlp__lr', 1e-6, 1e-2)
hidden = trial.suggest_int('mlp__hidden_units', 16, 1024, log=True)
dropout = trial.suggest_beta('mlp__dropout', 2, 5)
Note: Many libraries don't have suggest_int(..., log=True); implement by sampling log and exponentiating.
Experiment tracking & reproducibility: record your priors
If your experiment tracking stores only hyperparameter values but not the prior/space definition, you won't be able to reproduce the search behavior later. Log:
- The entire search space definition (bounds, transforms)
- Seed(s) for pseudo-random samplers
- Sampling strategy (random, BO, Hyperband) and settings (brackets, eta)
This ties back into the earlier module on experiment tracking: priors are part of your experiment design and must be versioned.
Final takeaways — TL;DR for the lazy (and brilliant)
- Use log-scale for multiplicative parameters (learning rates, regularization). Treat probabilities with Beta, counts with integer ranges.
- Make your hyperparameter space conditional and compact. No free-floating meaningless knobs.
- Run a small pilot to learn a prior; then refine and escalate to BO/Hyperband. Record everything.
- Remember: pruning strategies (Successive Halving, Hyperband) and warm starts can massively change which priors are efficient. Think about budget when designing priors.
Parting wisdom: designing hyperparameter spaces is half science, half game design. If your space is a chaotic monster, even the best optimizer will learn to be afraid of you. Be kind, be informed, and log everything — your future self will thank you.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!