Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

Grid Search and Random Search Bayesian Optimization Basics Successive Halving and Hyperband Early Stopping and Warm Starts Hyperparameter Spaces and Priors Pipeline Composition and Caching ColumnTransformers for Heterogeneous Data Custom Transformers and Estimators Cross-Validated Pipelines Refit Strategies and Model Persistence Reproducible Experiment Tracking Logging and Metadata Management Parallel and Distributed Tuning Budget-Aware Optimization Reusing and Sharing Artifacts

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Model Tuning, Pipelines, and Experiment Tracking

Model Tuning, Pipelines, and Experiment Tracking

19387 views

Automate workflows, search hyperparameters, and track experiments reproducibly.

Content

6 of 15

Pipeline Composition and Caching

Pipeline Wizardry — Cache Like a Pro

3410 views

intermediate

humorous

science

visual

gpt-5-mini

3410 views

Versions:

Pipeline Wizardry — Cache Like a Pro

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Pipeline Composition and Caching — Compose, Cache, Repeat (But Don’t Break Reproducibility)

"Pipelines are the choreography; caching is the memory of the dancer. Both must be in sync, or you get a very confused performance." — Your slightly dramatic ML TA

You've already seen how dimensionality reduction and feature selection reduce redundancy and highlight signal, and you've wrestled with hyperparameter spaces and early stopping/warm starts. Now we stitch those ideas together into production-ready pipelines that are fast, composable, and not cursed by repeated expensive computations.

Why this matters (and why you'll care)

When you chain feature engineering, selection, dimensionality reduction, and a learner, you want the result to be: reproducible, efficient, and safe from data leakage.
When hyperparameter tuning or cross-validation runs the pipeline dozens or hundreds of times, expensive steps (text vectorization, PCA, kernel computations, large imputation, or complex featurizers) can dominate runtime.
Caching saves time — but only if used thoughtfully. Misused caching can create stale artifacts, non-reproducible experiments, or misleading results.

Imagine running a grid search that takes 12 hours. You add caching and it becomes 20 minutes. That’s the kind of sorcery we’re aiming for—minus the hubris.

Pipeline composition: the rules of the road

Keep transformations inside the pipeline
- Why: Prevents data leakage. Feature selection or PCA must be fitted only on training folds during CV.
- Example: Use sklearn.pipeline.Pipeline and ColumnTransformer so selection and scaling are part of the CV process.
Make transformers deterministic or seedable
- Why: Caching relies on reproducible inputs; randomness without a fixed seed causes cache misses and nondeterministic behavior.
Small, single-purpose transformers win
- They are easier to test, cache, and parallelize.
Prefer stateless transforms where reasonable
- Stateless transforms (e.g., simple math formulas) are trivially cache-hit-friendly.
Connect with earlier lessons
- If you used feature selection or PCA from the previous topic, ensure they live in the pipeline—not applied to the whole dataset before CV.
- Hyperparameter tuning (priors/space) and warm-start techniques interact with caching: warm_start reduces recomputation across sequential fits; caching reduces redoing identical transforms.

Caching strategies (what to use, when, and why)

Strategy	Typical use-case	Pros	Cons
Pipeline.memory (sklearn + joblib)	Cache transform outputs and fitted transformers on disk	Easy to plug in; picks up repeated computations	Disk I/O overhead; stale caches if code changes
joblib.Memory.cache decorator	Cache specific expensive functions (custom transform)	Very explicit control; works outside pipeline	Need hashable inputs; manage cache lifecycle
In-memory memoization	Small fast ephemeral caches	Fast, no disk I/O	Memory pressure; not persistent between processes
Manual on-disk snapshots	Save fitted transformers or features	Reproducible snapshots	Manual bookkeeping, risk of path/format mismatches

Hands-on: a canonical sklearn example

from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from joblib import Memory

mem = Memory(location='cache_dir', verbose=1)

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),         # expensive - tokenization + ngrams
    ('pca', PCA(n_components=50)),         # expensive on large dense matrices
    ('clf', LogisticRegression())
], memory=mem)  # pipeline-level caching

# When you run GridSearchCV with pipeline, the costly tfidf/pca results can be cached

Notes:

pipeline.memory caches calls to transformers and their outputs (based on parameters and inputs). It uses joblib under the hood.
If you change the code of a transformer, joblib may not realize things changed and will reuse cached results — so you may need to clear cache_dir.

Important caveats — what trips people up

Data leakage vs caching: Caching a fitted transformer computed on the entire dataset outside the pipeline is a leak. Keep fitting inside the pipeline.
Different folds = different fitted transformers: In CV, fitting on different subsets means results generally differ. Cache hits across folds are rare for fitted objects unless the transform is deterministic and solely a function of the input array contents (rare across different train sets). So don’t expect magical cross-fold speedups for fit-dependent steps.
Randomness breaks cache hits: If a transformer uses randomness (e.g., randomized SVD), set the seed. Otherwise caching sees different outputs and keeps recomputing.
Warm starts vs caching: Warm starting a model (e.g., incremental warm_start=True) reduces retraining time across parameter sweeps for the same model. Caching is complementary but be cautious: a warm-started object might have internal state that makes caching its outputs unsafe.
Parallelism & file locks: Using GridSearchCV with n_jobs>1 plus pipeline memory can lead to multiple processes hitting the disk cache. joblib manages locks, but I/O overhead can still be a bottleneck. Measure and tune.

When to cache vs when not to

Cache when:
- A transform is expensive and deterministic given inputs + parameters (e.g., TF-IDF of the same raw texts).
- Multiple experiments reuse identical feature computation (e.g., feature pipelines reused across many models).
Avoid caching when:
- The transformation depends on the training split in CV (you want fresh fits).
- The step mutates global state or is nondeterministic without seeds.

Quick question: Why do many people think caching always helps? Because they forget I/O overhead and freshness. Caching is a tool — not a miracle.

Advanced: caching custom transformers

If you have a heavy custom function, wrap it:

from joblib import Memory
memory = Memory('cache_custom')

@memory.cache
def heavy_featurize(X, param1):
    # expensive science here
    return transformed_X

class HeavyTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, param1=1):
        self.param1 = param1
    def fit(self, X, y=None):
        self._dummy = 1
        return self
    def transform(self, X):
        return heavy_featurize(X, self.param1)

Notes: Inputs to cached functions must be hashable by joblib (numpy arrays are supported but be careful with mutable objects). Cache invalidation on code change requires deleting the cache directory.

Tie-in with experiment tracking and reproducibility

Log your cache directory path in your experiment metadata.
When using cached artifacts, save code version or git commit hash alongside cache to avoid stale re-use across code changes.
If a cached result is reused in a run tracked by your experiment logger (MLflow, Weights & Biases), record the artifact origin so others can reproduce.

TL;DR — The Cheat-Sheet

Put transforms inside pipelines to avoid leakage.
Cache expensive, deterministic, stateless steps (or ones that are identical across runs).
Seed randomness and be mindful of warm starts — they complicate caching.
Clear caches when code changes and record cache metadata in your experiment logs.

Final micro-psycho-philosophical note: Pipelines are trust contracts. Caching is the memory that enforces speed — but memory that lies is worse than no memory. Always make your caching honest: document it, version it, and invalidate it when your code or data structures change.

Tags: intermediate, humorous, science, visual

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics