Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

Holdout Validation Principles K-Fold Cross-Validation Stratified K-Fold for Classification Grouped and Blocked CV Time Series Split Strategies Nested Cross-Validation Repeated CV and ShuffleSplit Leakage-Free Preprocessing within CV Evaluating Variance of Estimates Confidence Intervals via Bootstrapping Model Selection vs Model Assessment Early Stopping with Validation Curves Learning Curves Interpretation Data Snooping and Multiple Testing Cross-Validation for Imbalanced Data

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Train/Validation/Test and Cross-Validation Strategies

Train/Validation/Test and Cross-Validation Strategies

25786 views

Design robust evaluation schemes and prevent leakage with correct resampling and learning curves.

Content

3 of 15

Stratified K-Fold for Classification

Stratified Sass: Class-Balanced K-Fold Explained

3362 views

intermediate

humorous

science

visual

gpt-5-mini

3362 views

Versions:

Stratified Sass: Class-Balanced K-Fold Explained

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Stratified K-Fold for Classification — Because Class Imbalance Hates Surprises

"If your folds are drama queens — wildly different class mixes — your CV score will be a soap opera. Stratify and calm the chaos." — The Voice of Good Validation Habits

You already know the drill from earlier modules: we looked at holdout validation (the single-split who thinks it can judge everything) and plain K-Fold cross-validation (the crowd-pleaser that reduces variance). You also practiced EDA to sniff out class imbalance, distribution shifts, and the little modeling landmines that make models cry. Now we build on that: Stratified K-Fold — the version of K-Fold that respects class proportions so your model evaluation doesn't throw a tantrum when it meets the real world.

Quick reminder (no rehashing, promise)

From Holdout: single split is fast but high-variance and sensitive to unlucky splits.
From K-Fold: averaging across K folds gives more stable estimates, but naïve K-Fold can accidentally create folds with very different class proportions.
From EDA: you should already know if classes are imbalanced, rare, or shifting. Stratification is a natural follow-up step once you see those imbalances.

What is Stratified K-Fold? (The elevator pitch)

Stratified K-Fold is K-Fold cross-validation where each fold preserves (approximately) the same class proportions as the full dataset. If your dataset is 90% class A and 10% class B, each fold should be roughly 90/10. It reduces the chance that a fold will have too few or no examples of a class — which is a one-way ticket to misleading metrics.

Why does this matter? Because classification metrics (accuracy, precision, recall, AUC) are sensitive to class distribution. If a fold lacks positive cases, the model may appear perfectly terrible or suspiciously perfect — neither is useful.

When to use Stratified K-Fold (and when not)

Use it if: you have classification tasks (binary or multiclass) and class proportions are imbalanced or non-uniform. Essential for rare-event problems (fraud, disease detection, etc.).
Don't use it if: you're doing regression (target is continuous) — stratification on a continuous target is meaningless unless you bin the target (see later). Also avoid stratifying on leak-prone labels or labels derived from future info.
Be careful: if you also need to respect groups (e.g., multiple rows per user), use GroupKFold or a combined strategy (StratifiedGroupKFold or manual grouping) — stratifying alone doesn't prevent leakage across groups.

How it works (simple algorithmic intuition)

For each class, split its examples into K nearly-equal subsets.
Construct K folds by taking one subset from each class.
Each fold now has roughly the same class mix as the whole dataset.

This keeps class representation consistent and yields more stable estimates of model performance.

Practical tips and pitfalls (listen up)

Small class counts: If a rare class has fewer than K examples, stratification will fail or lead to folds with zero examples. Fixes:
- Reduce K.
- Use repeated StratifiedKFold with fewer splits.
- Combine rare classes sensibly if domain-appropriate.
Multiclass complexity: StratifiedKFold works naturally for multiclass; however, very skewed multiclass distributions still suffer from small class counts.
Shuffling: Usually set shuffle=True with a fixed random_state to avoid deterministic order issues (e.g., sorted data causing weird splits).
Metrics matter: Use class-aware metrics (precision/recall/F1/AUC) not accuracy-only for imbalanced problems.
Do not stratify on engineered labels that leak future info. Stratify on the true, clean label column.
For regression: if you must stratify, bin the target into quantiles (e.g., pd.qcut) — but evaluate carefully: binning can hide nuances.

Quick code cheat-sheet (scikit-learn)

from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
clf = LogisticRegression()
scores = cross_val_score(clf, X, y, cv=skf, scoring='roc_auc')
print('AUC mean:', scores.mean())

If you prefer a single split that keeps proportions (train/test):

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

Short comparison table — which CV to pick?

Method	Keeps class balance?	Keeps groups?	Good for imbalanced?
KFold	No	No	No (risk of imbalance)
StratifiedKFold	Yes	No	Yes
GroupKFold	No	Yes	No (unless groups happen to match classes)
StratifiedShuffleSplit	Yes (for splits)	No	Yes (randomized)

Example scenarios (imagine the nightmares avoided)

Fraud detection (0.5% positives): plain K-Fold will likely produce some folds with zero fraud — your model will look like a fortune-teller who predicts none. Use StratifiedKFold.
Multi-hospital patient data: you need to avoid leaking patients between folds — use GroupKFold, possibly combined with stratification on outcome if you can (StratifiedGroupKFold or a manual stratified split within groups).
Ordinal target (mild/moderate/severe): stratify by the ordinal labels, but check counts — small 'severe' class might need merging or class-weighting.

A bite-sized validation recipe (practical checklist)

During EDA, check class counts and rare classes. If any class < K, reduce K or adjust.
Decide folds: StratifiedKFold(n_splits=K, shuffle=True, random_state=seed).
Use appropriate scoring (precision/recall/F1/AUC) and class-aware thresholding analysis.
Beware of leakage: ensure that temporal/group constraints are honored; if both group and stratify are needed, employ StratifiedGroupKFold or do a careful manual split.
Aggregate fold-level metrics and inspect per-class performance across folds (not just averages).

Final flourish (the mic-drop summary)

Stratified K-Fold is the pragmatic, slightly smug cousin of K-Fold that demands class balance in every fold. It's essential when classes are imbalanced or rare — because you want your cross-validation to reflect the real world, not a lens-blurred fantasy. Use it after EDA tells you about skew, but remember: it doesn't replace group-aware splitting or careful leakage checks. Treat it like a strong suggestion with a seatbelt: safe, sensible, and dramatically reduces evaluation surprises.

TL;DR: If your classes aren't evenly distributed, stratify your folds — your metrics will thank you, and your future self will send you a gratitude card.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics