Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

Grid Search and Random Search Bayesian Optimization Basics Successive Halving and Hyperband Early Stopping and Warm Starts Hyperparameter Spaces and Priors Pipeline Composition and Caching ColumnTransformers for Heterogeneous Data Custom Transformers and Estimators Cross-Validated Pipelines Refit Strategies and Model Persistence Reproducible Experiment Tracking Logging and Metadata Management Parallel and Distributed Tuning Budget-Aware Optimization Reusing and Sharing Artifacts

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Model Tuning, Pipelines, and Experiment Tracking

Model Tuning, Pipelines, and Experiment Tracking

19387 views

Automate workflows, search hyperparameters, and track experiments reproducibly.

Content

2 of 15

Bayesian Optimization Basics

Bayes But Make It Sassy

3338 views

intermediate

humorous

machine-learning

experiment-tracking

gpt-5-mini

3338 views

Versions:

Bayes But Make It Sassy

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Bayesian Optimization Basics — The Smart Hyperparameter Whisperer

"Grid search is a ritual. Random search is a party. Bayesian optimization is the friend who tells you what drink you actually like after two sips." — Your sarcastic TA

You already know the lay of the land: we tried Grid Search (painfully exhaustive) and Random Search (surprisingly effective) to tune models. You also learned how dimensionality reduction and feature selection can reduce redundancy and highlight signal. Now we upgrade: instead of blindly sampling the hyperparameter wilderness, we model the landscape and pick the most promising trails. That’s Bayesian optimization (BO) in a nutshell.

What is Bayesian Optimization? (Short, because you’re busy)

Bayesian optimization is a strategy for optimizing expensive, noisy black-box functions — like model validation accuracy as a function of hyperparameters — by building a cheap probabilistic surrogate model of the objective and using an acquisition function to decide the next hyperparameters to try.

Surrogate model: a probabilistic approximation (often a Gaussian Process) of how hyperparameters map to performance.
Acquisition function: an informed rule that balances exploration (try uncertain areas) and exploitation (try promising areas).

Why it’s useful: you get better results using far fewer model evaluations compared to grid or random search — ideal when training is costly (deep models, huge datasets, or nested CV).

Quick anatomy of the BO loop (aka how the magic happens)

Choose a hyperparameter search space (continuous, integer, categorical, conditional).
Evaluate the objective at a few initial points (random or Latin hypercube).
Fit a surrogate model on the observed (params → performance) points.
Use the acquisition function to pick the next hyperparameters.
Train & evaluate the model with those hyperparameters; add result to dataset.
Repeat until budget exhausted (time, iterations, or performance target).

# Pseudocode
D = {}  # observed (x, y)
for i in range(initial_points):
    x = sample_random()
    y = expensive_eval(x)
    D.add((x, y))
while budget_remaining:
    surrogate.fit(D)
    x_next = argmax_acquisition(acquisition, surrogate)
    y_next = expensive_eval(x_next)
    D.add((x_next, y_next))
return best(D)

Surrogate models — the nerdy heart

Gaussian Processes (GPs) — the classic choice. They give mean and variance predictions and are great for low-dimensional spaces (< ~20 dims). Elegant, but scale poorly with many observations (O(n^3)).
Random Forests / Tree-structured Parzen Estimators (TPE) — more robust for categorical and conditional spaces and scale better for many observations.
Neural network surrogates — used in Bayesian Neural Nets approaches for very large problems.

Table: Surrogate at a glance

Surrogate	Strengths	Weaknesses
Gaussian Process	Uncertainty quantification, principled	Scales poorly, struggles with high-dim categorical spaces
TPE (Hyperopt) / RF	Handles categorical & conditional, scales	Less principled uncertainty, heuristic-ish
NN-based	Scales, expressive	Complex, needs lots of data

Acquisition functions — choosing adventure vs. safety

EI (Expected Improvement) — picks points expected to beat the best-so-far by the most. Pretty common.
PI (Probability of Improvement) — greedy; picks points most likely to beat the best-so-far.
UCB (Upper Confidence Bound) — trades off mean + k * uncertainty; tunable exploration weight.
Thompson Sampling — sample from surrogate posterior then optimize that sample; naturally balances exploration/exploitation and is easy to parallelize.

Think of acquisition functions as your party-planning algorithm: do you try a drink that might be better (EI), pick the safest sure-thing (PI), meet new drinks because you’re curious (UCB), or randomly taste-test by following your fickle mood (Thompson)?

Practical considerations and gotchas

Start with a good search space. Bad priors (e.g., log-scale vs linear-scale mismatch) will waste budget. Use domain knowledge from feature selection/dimensionality reduction — e.g., fewer features may mean different regularization scales.
Conditional parameters. In pipelines you might have choices like: if model = X then tune these params; else tune those. Use BO frameworks that support conditional spaces (Optuna, SMAC, Hyperopt, scikit-optimize).
Categorical encoding. Treat categories explicitly, use one-hot only if surrogate handles it. GPs prefer continuous spaces.
Noisy evaluations. Use replicates or model noise in the surrogate. Consider smoothing via cross-validation or nested CV (BEWARE: expensive).
Parallel evaluations. Use batch BO or asynchronous strategies (Thompson sampling, batch EI). Classic GP-BO is inherently sequential, but many libraries support batching.
Budget & stopping. Predefine budget (time or evaluations). BO can overfit to noisy validation signals — use a holdout test set for final evaluation.

Pipelines & BO — how to keep your life tidy

You learned pipeline design earlier — great. Treat the whole pipeline as part of the search space: preprocessing choices, dimensionality reduction steps, feature selection thresholds, and model hyperparameters can all be tuned jointly.

Tips:

Use conditional parameters: only tune PCA components when PCA is selected.
Keep deterministic pipeline steps consistent (seed random states) for reproducibility.
If you use feature selection under imbalance, include class-weight or sampling strategy as tunable parameters, not hard-coded.

Experiment tracking — the boring but heroic step

Log everything. Seriously.

What to store per trial:

Hyperparameter values
Validation metric(s) and training curves
Random seed, dataset split identifiers
Timing (train and eval time) and resource usage
Surrogate model metadata and acquisition function used
Pipeline configuration (preprocessing, feature selection choices)

Why: you’ll want to reproduce the best trial, analyze failed runs, and detect data leakage or overfitting. Tools: MLflow, Weights & Biases, Sacred, or even a proper database table if you love SQL.

Quick comparison: Grid vs Random vs Bayesian

Method	Efficiency	Good for	Notes
Grid Search	Low	Very low-dim, interpretable	Explodes combinatorially
Random Search	Medium	Many dims with sparse important params	Simple, surprisingly strong
Bayesian Optimization	High	Expensive evaluations, few-to-moderate dims	Best when model eval cost is high

Recommended workflow (practical cheat sheet)

Define search space carefully (log-scale where needed; conditional parameters for pipelines).
Warm-start with a few random trials (5–20) or previous experiment results.
Choose surrogate (GP for continuous small dims; TPE/RF for mixed/large).
Pick acquisition function (EI/UCB or Thompson for parallel).
Run BO with a sensible budget; enable early-stopping to save time.
Log everything to your experiment tracker and snapshot the pipeline code.
Validate best candidates with nested CV or a fresh holdout.

Final kicker (why you’ll actually use BO)

Bayesian optimization turns hyperparameter tuning from guesswork into a data-informed exploration. It’s not magic; it’s smart resource allocation. When used with disciplined pipelines and rigorous experiment tracking, BO saves compute, reduces developer sweat, and makes your models genuinely better — especially when training is expensive and the search space is messy.

Takeaway: If grid search is a metronome and random search is a roulette wheel, Bayesian optimization is the detective who interrogates previous results and then picks the best suspect to test next.

Version notes: build on your grid/random intuition and your pipeline/feature-selection habits — BO is the logical upgrade once training runs cost real time and money.

Happy optimizing. Go run one experiment and then go outside — your computer deserves a break and so do you.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics