Data Cleaning and Feature Engineering
Prepare high-quality datasets with robust transformations and informative features while avoiding leakage.
Content
Addressing Class Imbalance
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Addressing Class Imbalance — Make Your Rare Class Stop Being Ignored
"If your model was a high school party, class imbalance is that tiny group in the corner who never get asked to dance." — Your friendly (and slightly dramatic) TA
Why this matters (and why your accuracy is lying)
You've parsed dates, cleaned messy text, and wrestled with joins in pandas. Great — now imagine you train a classifier on a dataset where 98% of labels are "no" and 2% are "yes." Your model can predict "no" forever and still score 98% accuracy. But in fraud detection, medical diagnosis, or rare-event prediction, that 2% is the whole point.
In short: class imbalance breaks naive training and naive evaluation. You must handle it during cleaning and feature engineering if you want models that actually learn the minority signal.
Quick-check with pandas (remember the data analysis chapter)
Use pandas to spot imbalance immediately:
# assume df has column 'label'
import pandas as pd
counts = df['label'].value_counts(normalize=True)
print(counts)
# visual check
counts.plot(kind='bar')
If one bar dominates, that's your cue to act. Also check imbalance over time if your data is temporal (you parsed datetimes earlier):
# after datetime parsing from the earlier module
df.set_index('event_time').resample('M')['label'].value_counts(normalize=True)
Why? Because imbalance can drift — and a model trained on one era might fail later.
Options to address class imbalance (the toolbox)
I'll list the approaches, when to use them, and their key pitfalls.
Stratified splitting (always do this first)
- Use train_test_split(..., stratify=y) or StratifiedKFold.
- Prevents accidentally creating a train or test set with zero minority examples.
- Pitfall: For time-series, use time-based validation instead and ensure minority events exist across time folds.
Resampling
- Oversampling: duplicate minority rows (simple) or create synthetic samples (SMOTE).
- Undersampling: drop majority rows to balance proportions.
- Use when model performance is harmed and data quantity is manageable.
- Pitfall: Oversampling before data splitting causes data leakage. Always resample inside CV folds or in a Pipeline after splitting.
Synthetic oversampling (SMOTE, ADASYN)
- SMOTE creates synthetic minority samples by interpolating neighbors.
- Great when minority class examples are similar in feature space.
- Pitfall: For high-dimensional sparse text features (e.g., TF-IDF) SMOTE can produce unrealistic samples. Instead, consider generating synthetic text (hard) or use class weights.
Class weights / cost-sensitive learning
- Set model-specific weights (e.g., LogisticRegression(class_weight='balanced'), RandomForest with class_weight, XGBoost scale_pos_weight).
- No resampling — model penalizes misclassifying minority more heavily.
- Works well for many tree and linear models and avoids data duplication.
Ensemble and hybrid methods
- BalancedBaggingClassifier, EasyEnsemble (undersampling ensembles) — combine multiple weak models on balanced subsets.
- Useful when both undersampling and variance are concerns.
Threshold tuning and probability calibration
- Move decision threshold or calibrate predicted probabilities (CalibratedClassifierCV) rather than sticking with 0.5.
- Use precision-recall trade-offs to choose operational thresholds.
Anomaly detection / one-class models
- If minority class is extremely rare and qualitatively different, treat it as anomaly detection.
Practical recipes (code + pipeline patterns)
Always combine resampling inside pipelines so transformations are fit only on training data.
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)
pipeline = ImbPipeline([
('tfidf', TfidfVectorizer()), # if text features exist (text cleaning module)
('smote', SMOTE(random_state=42)),
('clf', LogisticRegression(max_iter=1000))
])
pipeline.fit(X_train, y_train)
If using numeric features and tree models, try class weights instead:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(class_weight='balanced', n_estimators=200)
clf.fit(X_train, y_train)
For XGBoost:
# scale_pos_weight = n_negative / n_positive
scale = (y_train == 0).sum() / (y_train == 1).sum()
model = xgb.XGBClassifier(scale_pos_weight=scale)
Evaluate correctly: precision, recall, and PR-AUC > accuracy
When classes are imbalanced, prefer:
- Precision (how many predicted positives are true)
- Recall / Sensitivity (how many true positives did we catch)
- F1-score (harmonic mean of precision & recall)
- Precision-Recall AUC (better than ROC-AUC when positive class is rare)
Plot PR curves, confusion matrices, and use threshold tuning to pick operating points that match business needs.
Special considerations — tiebacks to datetime & text cleaning
- If you're using datetime-derived features (hour, day, trend), check whether the minority events cluster in certain periods. If they do, performing time-aware resampling or stratification is critical.
- For text features (from the earlier text cleaning section), avoid naive SMOTE on TF-IDF matrices. Instead:
- Use class weights, or
- Generate new synthetic text examples via data augmentation (back-translation, synonym replacement), or
- Use embedding-based SMOTE on dense vector representations (e.g., sentence embeddings).
Quick decision flow (cheat sheet)
- Stratify your splits (or use time-based splits).
- If minority > ~5%: try class weights first.
- If minority between ~1%–5% and numeric features: try SMOTE inside CV/Pipeline.
- If minority < 1% or noisy: consider anomaly detection or ensembles (EasyEnsemble).
- Always evaluate with PR-AUC, recall, and confusion matrices.
Common mistakes to avoid
- Oversampling before splitting → data leakage.
- Using accuracy as the main metric.
- Applying SMOTE to sparse high-dimensional text without care.
- Ignoring temporal drift in class balance.
Key takeaways
- Class imbalance is a data problem and an evaluation problem. Fix both.
- Use stratified splits, resample inside pipelines/CV, and prefer class weights for many models.
- Evaluate with precision/recall/F1 and PR-AUC — accuracy is often meaningless.
"Balancing classes is less about making your dataset pretty and more about making your model honest." — There, I said it.
If you want, I can generate a ready-to-run notebook that demonstrates: stratified time split, SMOTE vs class weights, and PR curve comparison on a toy imbalanced dataset. Want that?
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!