Courses/Python for Data Science, AI & Development/Machine Learning with scikit-learn

Machine Learning with scikit-learn

44937 views

Build, tune, and evaluate models using scikit-learn pipelines with reproducible ML workflows.

Content

2 of 15

Data Splits and CV Strategies

Data Splits and CV Strategies with scikit-learn (Practical)

9051 views

intermediate

python

machine-learning

scikit-learn

cross-validation

gpt-5-mini

9051 views

Versions:

Data Splits and CV Strategies with scikit-learn (Practical)

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Data Splits and CV Strategies — Practical scikit-learn Guide

"Good cross-validation is like a good exam: it tests for real understanding, not memorized answers."

You've already seen ML workflows and Pipelines in scikit-learn (nice — that means you know not to leak your scaler into the test set). You've also built statistical intuition for power, sample size, and confounding. Now let's connect those ideas to how we split data and validate models so your evaluation is honest, reproducible, and useful.

Why this matters (short answer)

A bad split or CV strategy gives you overconfident performance estimates.
A wrong strategy wastes time (tuning on leakage) or misses structure (groups, time-dependence).
The right split respects the data-generating process, and the right CV reduces variance in estimates while controlling bias.

Key concepts (quick glossary)

Train / Validation / Test: Train fits the model, validation (or CV) tunes hyperparameters, test is the final unseen evaluation. Keep test sacred.
Cross-Validation (CV): splitting training data into folds to estimate generalization.
Stratification: keeps class proportions similar across folds (important for imbalanced classes).
Group splits: ensure data from the same group (user, hospital, city) doesn't leak across folds.
Time-series split: respect temporal ordering; no future -> past leakage.

Practical scikit-learn strategies (what to use and when)

1) Always keep a holdout test set

Use train_test_split(..., test_size=0.1 or 0.2, random_state=42) to create a final test set.
Do not touch this test set until the end. It is your unbiased exam.

Code micro-example:

from sklearn.model_selection import train_test_split
X_trainval, X_test, y_trainval, y_test = train_test_split(
    X, y, test_size=0.15, stratify=y, random_state=42
)

Why: hyperparameter tuning and repeated evaluation on the same set leads to optimistic estimates — remember our pipeline lesson on leakage.

2) Pick the CV flavor to match data structure

Classification, balanced: StratifiedKFold(n_splits=5)
Classification, imbalanced: StratifiedKFold or RepeatedStratifiedKFold; consider class-weighted models or resampling
Grouped data (patients, users): GroupKFold(n_splits=k)
Time series: TimeSeriesSplit(n_splits=k) (no shuffle!)
Small n: LeaveOneOut (LOO) — high-variance, expensive; use cautiously

Examples:

from sklearn.model_selection import StratifiedKFold, GroupKFold, TimeSeriesSplit
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_group = GroupKFold(n_splits=5)
cv_time = TimeSeriesSplit(n_splits=5)

Micro explanation: If you have groups (e.g. multiple rows per patient), using simple KFold will let the model see other samples from the same patient in training — inflating performance.

3) Nested CV for honest hyperparameter selection

Outer loop: estimate generalization.
Inner loop: pick hyperparameters.

Use nested CV when you need an unbiased estimate of the performance of the fully tuned model. GridSearchCV inside cross_val_score (or use cross_validate over an outer split) is the common pattern.

from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline

pipeline = Pipeline([...])
param_grid = {...}
inner_cv = StratifiedKFold(5)
outer_cv = StratifiedKFold(5, shuffle=True, random_state=0)

grid = GridSearchCV(pipeline, param_grid, cv=inner_cv)
scores = cross_val_score(grid, X_trainval, y_trainval, cv=outer_cv)

Why: If you do CV to choose parameters and then report the CV score from that same procedure (without nested structure or a held-out test), you optimisticly bias the estimate.

4) Use Pipelines to avoid leakage

Put scaling, imputation, encoding inside Pipeline before CV.
Never fit transformers on the full dataset before splitting.

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression())
])

Remember: preprocessing inside pipeline is fitted on the training fold only. This is where your pipeline knowledge ties directly into CV integrity.

5) Repeated CV and number of folds

Typical choices: 5 or 10 folds. 5 is faster, 10 gives a bit lower bias.
RepeatedKFold or RepeatedStratifiedKFold can reduce variance of the estimate by averaging across random splits.
For very small datasets, LOO is an option but be aware of high variance and model overfitting risk.

Evaluation & uncertainty: mean, std, and confidence

Report mean CV score ± std. But beware: std across folds doesn't equal standard error across hypothetical datasets.
For better uncertainty quantification, use repeated CV or bootstrap on the CV scores.
Connect to your previous work on power and sample size: smaller datasets -> higher variance in CV estimates. If your CV scores jump around a lot, you need more data (or simpler models).

Common mistakes (and how to avoid them)

Tuning on the test set → set aside a final holdout
Preprocessing before splitting → always pipeline
Using KFold for grouped data → use GroupKFold
Shuffling time series → use TimeSeriesSplit
Ignoring class imbalance → stratify or use proper metrics (AUPRC for rare positives)

"CV doesn't fix bad data or fundamental confounding — it just helps you measure generalization correctly."

Quick decision map (cheat-sheet)

Do you have time dependence? → TimeSeriesSplit
Do you have groups that must be kept together? → GroupKFold
Binary classification with imbalance? → StratifiedKFold or repeated stratified
Doing hyperparameter tuning and want unbiased estimate? → Nested CV

Closing takeaways

Always keep a final held-out test set. Treat it like the final exam.
Choose the CV strategy that reflects how new data will appear in the real world (time, groups, class ratios).
Use Pipelines to prevent leakage of preprocessing into validation.
Use nested CV for honest hyperparameter evaluation and repeated CV for more stable estimates.

Imagine presenting a model's performance at a review meeting. You want your number to be believable, reproducible, and defensible. Use proper splits and CV strategies, and you won't be the person apologizing for a model that collapsed in production.