jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Python for Data Science, AI & Development
Chapters

1Python Foundations for Data Work

2Data Structures and Iteration

3Numerical Computing with NumPy

4Data Analysis with pandas

5Data Cleaning and Feature Engineering

Detecting and Handling OutliersImputation StrategiesScaling and NormalizationEncoding Categorical VariablesFeature Binning and DiscretizationFeature Interactions and PolynomialsText Cleaning BasicsDatetime Parsing and FeaturesAddressing Class ImbalanceTarget Leakage AvoidanceTrain–Validation SplitsPipeline-Friendly TransformsFeature Selection MethodsDimensionality ReductionMulticollinearity and Correlation

6Data Visualization and Storytelling

7Statistics and Probability for Data Science

8Machine Learning with scikit-learn

9Deep Learning Foundations

10Data Sources, Engineering, and Deployment

Courses/Python for Data Science, AI & Development/Data Cleaning and Feature Engineering

Data Cleaning and Feature Engineering

43367 views

Prepare high-quality datasets with robust transformations and informative features while avoiding leakage.

Content

9 of 15

Addressing Class Imbalance

Addressing Class Imbalance in Python: Resampling & Weights
931 views
intermediate
python
data-science
humorous
feature-engineering
gpt-5-mini
931 views

Versions:

Addressing Class Imbalance in Python: Resampling & Weights

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Addressing Class Imbalance — Make Your Rare Class Stop Being Ignored

"If your model was a high school party, class imbalance is that tiny group in the corner who never get asked to dance." — Your friendly (and slightly dramatic) TA


Why this matters (and why your accuracy is lying)

You've parsed dates, cleaned messy text, and wrestled with joins in pandas. Great — now imagine you train a classifier on a dataset where 98% of labels are "no" and 2% are "yes." Your model can predict "no" forever and still score 98% accuracy. But in fraud detection, medical diagnosis, or rare-event prediction, that 2% is the whole point.

In short: class imbalance breaks naive training and naive evaluation. You must handle it during cleaning and feature engineering if you want models that actually learn the minority signal.


Quick-check with pandas (remember the data analysis chapter)

Use pandas to spot imbalance immediately:

# assume df has column 'label'
import pandas as pd
counts = df['label'].value_counts(normalize=True)
print(counts)
# visual check
counts.plot(kind='bar')

If one bar dominates, that's your cue to act. Also check imbalance over time if your data is temporal (you parsed datetimes earlier):

# after datetime parsing from the earlier module
df.set_index('event_time').resample('M')['label'].value_counts(normalize=True)

Why? Because imbalance can drift — and a model trained on one era might fail later.


Options to address class imbalance (the toolbox)

I'll list the approaches, when to use them, and their key pitfalls.

  1. Stratified splitting (always do this first)

    • Use train_test_split(..., stratify=y) or StratifiedKFold.
    • Prevents accidentally creating a train or test set with zero minority examples.
    • Pitfall: For time-series, use time-based validation instead and ensure minority events exist across time folds.
  2. Resampling

    • Oversampling: duplicate minority rows (simple) or create synthetic samples (SMOTE).
    • Undersampling: drop majority rows to balance proportions.
    • Use when model performance is harmed and data quantity is manageable.
    • Pitfall: Oversampling before data splitting causes data leakage. Always resample inside CV folds or in a Pipeline after splitting.
  3. Synthetic oversampling (SMOTE, ADASYN)

    • SMOTE creates synthetic minority samples by interpolating neighbors.
    • Great when minority class examples are similar in feature space.
    • Pitfall: For high-dimensional sparse text features (e.g., TF-IDF) SMOTE can produce unrealistic samples. Instead, consider generating synthetic text (hard) or use class weights.
  4. Class weights / cost-sensitive learning

    • Set model-specific weights (e.g., LogisticRegression(class_weight='balanced'), RandomForest with class_weight, XGBoost scale_pos_weight).
    • No resampling — model penalizes misclassifying minority more heavily.
    • Works well for many tree and linear models and avoids data duplication.
  5. Ensemble and hybrid methods

    • BalancedBaggingClassifier, EasyEnsemble (undersampling ensembles) — combine multiple weak models on balanced subsets.
    • Useful when both undersampling and variance are concerns.
  6. Threshold tuning and probability calibration

    • Move decision threshold or calibrate predicted probabilities (CalibratedClassifierCV) rather than sticking with 0.5.
    • Use precision-recall trade-offs to choose operational thresholds.
  7. Anomaly detection / one-class models

    • If minority class is extremely rare and qualitatively different, treat it as anomaly detection.

Practical recipes (code + pipeline patterns)

Always combine resampling inside pipelines so transformations are fit only on training data.

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

pipeline = ImbPipeline([
    ('tfidf', TfidfVectorizer()),  # if text features exist (text cleaning module)
    ('smote', SMOTE(random_state=42)),
    ('clf', LogisticRegression(max_iter=1000))
])

pipeline.fit(X_train, y_train)

If using numeric features and tree models, try class weights instead:

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(class_weight='balanced', n_estimators=200)
clf.fit(X_train, y_train)

For XGBoost:

# scale_pos_weight = n_negative / n_positive
scale = (y_train == 0).sum() / (y_train == 1).sum()
model = xgb.XGBClassifier(scale_pos_weight=scale)

Evaluate correctly: precision, recall, and PR-AUC > accuracy

When classes are imbalanced, prefer:

  • Precision (how many predicted positives are true)
  • Recall / Sensitivity (how many true positives did we catch)
  • F1-score (harmonic mean of precision & recall)
  • Precision-Recall AUC (better than ROC-AUC when positive class is rare)

Plot PR curves, confusion matrices, and use threshold tuning to pick operating points that match business needs.


Special considerations — tiebacks to datetime & text cleaning

  • If you're using datetime-derived features (hour, day, trend), check whether the minority events cluster in certain periods. If they do, performing time-aware resampling or stratification is critical.
  • For text features (from the earlier text cleaning section), avoid naive SMOTE on TF-IDF matrices. Instead:
    • Use class weights, or
    • Generate new synthetic text examples via data augmentation (back-translation, synonym replacement), or
    • Use embedding-based SMOTE on dense vector representations (e.g., sentence embeddings).

Quick decision flow (cheat sheet)

  1. Stratify your splits (or use time-based splits).
  2. If minority > ~5%: try class weights first.
  3. If minority between ~1%–5% and numeric features: try SMOTE inside CV/Pipeline.
  4. If minority < 1% or noisy: consider anomaly detection or ensembles (EasyEnsemble).
  5. Always evaluate with PR-AUC, recall, and confusion matrices.

Common mistakes to avoid

  • Oversampling before splitting → data leakage.
  • Using accuracy as the main metric.
  • Applying SMOTE to sparse high-dimensional text without care.
  • Ignoring temporal drift in class balance.

Key takeaways

  • Class imbalance is a data problem and an evaluation problem. Fix both.
  • Use stratified splits, resample inside pipelines/CV, and prefer class weights for many models.
  • Evaluate with precision/recall/F1 and PR-AUC — accuracy is often meaningless.

"Balancing classes is less about making your dataset pretty and more about making your model honest." — There, I said it.

If you want, I can generate a ready-to-run notebook that demonstrates: stratified time split, SMOTE vs class weights, and PR curve comparison on a toy imbalanced dataset. Want that?

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics