Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

Holdout Validation Principles K-Fold Cross-Validation Stratified K-Fold for Classification Grouped and Blocked CV Time Series Split Strategies Nested Cross-Validation Repeated CV and ShuffleSplit Leakage-Free Preprocessing within CV Evaluating Variance of Estimates Confidence Intervals via Bootstrapping Model Selection vs Model Assessment Early Stopping with Validation Curves Learning Curves Interpretation Data Snooping and Multiple Testing Cross-Validation for Imbalanced Data

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Train/Validation/Test and Cross-Validation Strategies

Train/Validation/Test and Cross-Validation Strategies

25786 views

Design robust evaluation schemes and prevent leakage with correct resampling and learning curves.

Content

4 of 15

Grouped and Blocked CV

Grouped and Blocked CV — The No‑Leakage Drill Sergeant

4175 views

intermediate

humorous

science

visual

gpt-5-mini

4175 views

Versions:

Grouped and Blocked CV — The No‑Leakage Drill Sergeant

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Grouped and Blocked Cross-Validation — The No‑Leakage Drill Sergeant

You did EDA and found weird clusters, repeat customers, or time drift. Now what? Do not randomly shuffle and call it a day. That is how models learn to cheat.

You already met K‑Fold and Stratified K‑Fold. Those are the well‑meaning party guests who mix everyone up evenly. But sometimes your data is sitting in cliques: the same customer appears 10 times, patients have multiple visits, sensors live on the same device, or timestamps march relentlessly forward. In those cases you need the tougher, smarter guest: Grouped and Blocked cross‑validation.

Why grouped and blocked CV exist (aka the leak that keeps on leaking)

Grouped CV prevents information from the same entity (a group) leaking between train and validation. If you split at the sample level while the same user appears in both train and validation, your model memorizes user id patterns and performance is overoptimistic.
Blocked CV prevents leakage due to ordering (time) or spatial proximity. For time series, training on future data to predict the past is a crime. For spatial data, nearby locations share signals and should be blocked together.

Quick EDA checks to trigger these methods:

Count distinct group ids and tabulate frequency per group. If many samples per group, suspect dependence.
Plot target distribution by group; look for low within‑group variance.
For time: plot target time series, compute autocorrelation, and check for distribution shift across time windows.
For space: plot residuals or target over map; look for spatial clustering.

Grouped CV: Leave the family outside the door

When to use

Repeated measurements: health records with multiple visits per patient
User behavior: multiple interactions per user
Hierarchical data: items nested in stores, students in classrooms

Common strategies

GroupKFold: split so that each fold has whole groups, never splitting a group across train/val.
LeaveOneGroupOut (LOGO): extreme form, hold out one group at a time. Good when number of groups is moderate and you want robust generalization to unseen groups.

Practical tips

If groups have very uneven sizes, folds get imbalanced. Consider grouping at a coarser level or using group‑aware stratification.
If you need class balance inside groups, try StratifiedGroupKFold (available in later sklearn versions) or implement a heuristic to balance labels per fold at the group level.

Code sketch (scikit‑learn style)

from sklearn.model_selection import GroupKFold
cv = GroupKFold(n_splits=5)
scores = cross_val_score(model, X, y, groups=group_ids, cv=cv)

Ask yourself: do I care about per‑sample accuracy or per‑group fairness? If the latter, average metrics per group rather than by sample.

Blocked CV: Time and space are not random

When to use

Time series forecasting and any temporally ordered problem
Spatial modeling where nearby samples are correlated

Patterns

TimeSeriesSplit: forward chaining / rolling window splits that respect temporal order. Train on earlier times, validate on later times.
Expanding window: train expands with time; validation moves forward.
Sliding window: train window slides forward to focus on recent behavior.
Spatial blocking: break space into tiles/blocks, then cross‑validate across blocks to avoid spatial leakage.

Why the naive random split fails for time

Randomly shuffling breaks causality. Your model learns signals that only exist because future data leaked into training. That is not real forecasting ability — it is a magic trick.

Example time CV pseudocode

# assume df sorted by timestamp
for fold in range(n_folds):
    train = df.loc[:train_end_idx]
    val = df.loc[train_end_idx+1:val_end_idx]
    train_end_idx += step
    val_end_idx += step

Pro tip: include a gap between train and validation windows when short‑term leakage is possible (eg. label generation uses future info or measurement lag).

Grouped vs Blocked: a cheat sheet

Situation	Use	Typical scikit‑learn tool
Same user appears many times	Grouped CV	GroupKFold, LeaveOneGroupOut
Temporal dependence	Blocked CV	TimeSeriesSplit, custom rolling window
Spatial dependence	Blocked CV	Spatial blocking (tile + K‑Fold)
Need class balance and group safety	Grouped + stratify	StratifiedGroupKFold or custom solver

Practical gotchas and how to handle them

Unequal group sizes
- Small groups: collapse or remove if meaningless
- Huge groups: they dominate folds. Consider stratifying groups by size or using LOGO
Class imbalance across groups
- Try group‑level stratification
- If impossible, use resampling at the group level, not sample level
Hyperparameter tuning
- Always use the same grouping logic in inner CV inside GridSearchCV. Pass groups to GridSearchCV so inner folds respect grouping.
Metric averaging
- If fairness across groups matters, compute per‑group metrics and average them. Otherwise large groups will dominate.
When groups correlate with time or space
- Combine approaches: e.g., do grouped blocked CV where you block by time within groups or leave group out but also keep temporal order in train set.

A short workflow checklist (because chaos is not a strategy)

Did EDA show repeated IDs, time drift, or spatial clustering? If yes, do not use plain K‑Fold.
Decide grouping variable(s) and validate group counts and sizes.
Choose CV strategy: GroupKFold, LOGO, TimeSeriesSplit, spatial blocks, or hybrid.
If tuning, nest grouped/blocked CV inside hyperparameter search and pass groups.
Report evaluation with the right averaging (per sample vs per group) and include uncertainty (fold std).

Expert takeaway: The point of grouped and blocked CV is to make your validation mimic the real world. If your production setting will see new users, future timestamps, or new locations, your CV must hold those same kinds of unknowns out of training.

Final pep talk

You did EDA, found the cracks, and now you are sealing them with the right CV mortar. Grouped and blocked cross‑validation are not optional niceties — they are the difference between a model that performs well in a carefully curated lab and one that survives the wild. Use them, and your reported metrics will stop being lies you tell yourself and start being honest signals you can trust.

Key takeaways

Use grouped CV when samples are linked by entity; use blocked CV when order or proximity matters.
Respect grouping in both outer evaluation and inner tuning loops.
Do EDA specific to grouping and blocking before choosing a strategy.

Go forth and cross‑validate like a careful, slightly paranoid scientist. Your future self (and your stakeholders) will thank you.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics