Machine Learning with scikit-learn
Build, tune, and evaluate models using scikit-learn pipelines with reproducible ML workflows.
Content
Data Splits and CV Strategies
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Data Splits and CV Strategies — Practical scikit-learn Guide
"Good cross-validation is like a good exam: it tests for real understanding, not memorized answers."
You've already seen ML workflows and Pipelines in scikit-learn (nice — that means you know not to leak your scaler into the test set). You've also built statistical intuition for power, sample size, and confounding. Now let's connect those ideas to how we split data and validate models so your evaluation is honest, reproducible, and useful.
Why this matters (short answer)
- A bad split or CV strategy gives you overconfident performance estimates.
- A wrong strategy wastes time (tuning on leakage) or misses structure (groups, time-dependence).
- The right split respects the data-generating process, and the right CV reduces variance in estimates while controlling bias.
Key concepts (quick glossary)
- Train / Validation / Test: Train fits the model, validation (or CV) tunes hyperparameters, test is the final unseen evaluation. Keep test sacred.
- Cross-Validation (CV): splitting training data into folds to estimate generalization.
- Stratification: keeps class proportions similar across folds (important for imbalanced classes).
- Group splits: ensure data from the same group (user, hospital, city) doesn't leak across folds.
- Time-series split: respect temporal ordering; no future -> past leakage.
Practical scikit-learn strategies (what to use and when)
1) Always keep a holdout test set
- Use train_test_split(..., test_size=0.1 or 0.2, random_state=42) to create a final test set.
- Do not touch this test set until the end. It is your unbiased exam.
Code micro-example:
from sklearn.model_selection import train_test_split
X_trainval, X_test, y_trainval, y_test = train_test_split(
X, y, test_size=0.15, stratify=y, random_state=42
)
Why: hyperparameter tuning and repeated evaluation on the same set leads to optimistic estimates — remember our pipeline lesson on leakage.
2) Pick the CV flavor to match data structure
- Classification, balanced: StratifiedKFold(n_splits=5)
- Classification, imbalanced: StratifiedKFold or RepeatedStratifiedKFold; consider class-weighted models or resampling
- Grouped data (patients, users): GroupKFold(n_splits=k)
- Time series: TimeSeriesSplit(n_splits=k) (no shuffle!)
- Small n: LeaveOneOut (LOO) — high-variance, expensive; use cautiously
Examples:
from sklearn.model_selection import StratifiedKFold, GroupKFold, TimeSeriesSplit
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_group = GroupKFold(n_splits=5)
cv_time = TimeSeriesSplit(n_splits=5)
Micro explanation: If you have groups (e.g. multiple rows per patient), using simple KFold will let the model see other samples from the same patient in training — inflating performance.
3) Nested CV for honest hyperparameter selection
- Outer loop: estimate generalization.
- Inner loop: pick hyperparameters.
Use nested CV when you need an unbiased estimate of the performance of the fully tuned model. GridSearchCV inside cross_val_score (or use cross_validate over an outer split) is the common pattern.
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
pipeline = Pipeline([...])
param_grid = {...}
inner_cv = StratifiedKFold(5)
outer_cv = StratifiedKFold(5, shuffle=True, random_state=0)
grid = GridSearchCV(pipeline, param_grid, cv=inner_cv)
scores = cross_val_score(grid, X_trainval, y_trainval, cv=outer_cv)
Why: If you do CV to choose parameters and then report the CV score from that same procedure (without nested structure or a held-out test), you optimisticly bias the estimate.
4) Use Pipelines to avoid leakage
- Put scaling, imputation, encoding inside Pipeline before CV.
- Never fit transformers on the full dataset before splitting.
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
pipeline = Pipeline([
('imputer', SimpleImputer()),
('scaler', StandardScaler()),
('clf', LogisticRegression())
])
Remember: preprocessing inside pipeline is fitted on the training fold only. This is where your pipeline knowledge ties directly into CV integrity.
5) Repeated CV and number of folds
- Typical choices: 5 or 10 folds. 5 is faster, 10 gives a bit lower bias.
- RepeatedKFold or RepeatedStratifiedKFold can reduce variance of the estimate by averaging across random splits.
- For very small datasets, LOO is an option but be aware of high variance and model overfitting risk.
Evaluation & uncertainty: mean, std, and confidence
- Report mean CV score ± std. But beware: std across folds doesn't equal standard error across hypothetical datasets.
- For better uncertainty quantification, use repeated CV or bootstrap on the CV scores.
- Connect to your previous work on power and sample size: smaller datasets -> higher variance in CV estimates. If your CV scores jump around a lot, you need more data (or simpler models).
Common mistakes (and how to avoid them)
- Tuning on the test set → set aside a final holdout
- Preprocessing before splitting → always pipeline
- Using KFold for grouped data → use GroupKFold
- Shuffling time series → use TimeSeriesSplit
- Ignoring class imbalance → stratify or use proper metrics (AUPRC for rare positives)
"CV doesn't fix bad data or fundamental confounding — it just helps you measure generalization correctly."
Quick decision map (cheat-sheet)
- Do you have time dependence? → TimeSeriesSplit
- Do you have groups that must be kept together? → GroupKFold
- Binary classification with imbalance? → StratifiedKFold or repeated stratified
- Doing hyperparameter tuning and want unbiased estimate? → Nested CV
Closing takeaways
- Always keep a final held-out test set. Treat it like the final exam.
- Choose the CV strategy that reflects how new data will appear in the real world (time, groups, class ratios).
- Use Pipelines to prevent leakage of preprocessing into validation.
- Use nested CV for honest hyperparameter evaluation and repeated CV for more stable estimates.
Imagine presenting a model's performance at a review meeting. You want your number to be believable, reproducible, and defensible. Use proper splits and CV strategies, and you won't be the person apologizing for a model that collapsed in production.
Further reading / next steps
- Implement a Pipeline + GridSearchCV with StratifiedKFold.
- Try GroupKFold on a dataset with subjects or locations.
- Revisit power/sample-size lessons: simulate how CV variance shrinks as n increases.
Happy validating — go make your model honest (and slightly less smug).
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!