jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Python for Data Science, AI & Development
Chapters

1Python Foundations for Data Work

2Data Structures and Iteration

3Numerical Computing with NumPy

4Data Analysis with pandas

5Data Cleaning and Feature Engineering

Detecting and Handling OutliersImputation StrategiesScaling and NormalizationEncoding Categorical VariablesFeature Binning and DiscretizationFeature Interactions and PolynomialsText Cleaning BasicsDatetime Parsing and FeaturesAddressing Class ImbalanceTarget Leakage AvoidanceTrain–Validation SplitsPipeline-Friendly TransformsFeature Selection MethodsDimensionality ReductionMulticollinearity and Correlation

6Data Visualization and Storytelling

7Statistics and Probability for Data Science

8Machine Learning with scikit-learn

9Deep Learning Foundations

10Data Sources, Engineering, and Deployment

Courses/Python for Data Science, AI & Development/Data Cleaning and Feature Engineering

Data Cleaning and Feature Engineering

43367 views

Prepare high-quality datasets with robust transformations and informative features while avoiding leakage.

Content

2 of 15

Imputation Strategies

Imputation Strategies for Data Cleaning in Python: Practical Tips
4190 views
beginner
intermediate
python
data-cleaning
feature-engineering
gpt-5-mini
4190 views

Versions:

Imputation Strategies for Data Cleaning in Python: Practical Tips

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Imputation Strategies — Fill the Holes Without Breaking Your Model

"This is the moment where the concept finally clicks." — when you stop treating missing data like a problem to ignore and start treating it like information to respect.


If you just finished cleaning outliers and wrangling data from SQL tables with pandas (and maybe used regex to unearth sketchy placeholder strings), congratulations — you’re ready for the next battlefield: missing values. Imputation is the art (and science) of filling gaps in your dataset so models can do their work without tripping over NaNs.

This guide builds on those earlier lessons: use your pandas indexing and joins knowledge to prepare groups for group-wise imputation, and your string-methods/regex toolkit to detect placeholder values ("N/A", "none", "-999") before you start imputing.

Why imputation matters

  • Many ML algorithms can't accept NaNs.
  • Improper imputation leaks information or biases estimates.
  • Different missingness mechanisms (MCAR/MAR/MNAR) demand different strategies.

Quick definitions (the three musketeers of missingness)

  • MCAR (Missing Completely At Random): missingness unrelated to data. Rare but easiest.
  • MAR (Missing At Random): missingness depends on observed data (e.g., older patients more likely to skip a test).
  • MNAR (Missing Not At Random): missingness depends on the missing values themselves (e.g., high incomes hide). Hard to fix—requires modeling the missingness itself.

Basic imputation techniques (start here)

1) Drop rows/columns

  • Use only when missingness is tiny or entire column is useless.
  • Fast, but potentially wasteful.

2) Constant replacement

  • Numeric -> 0 or some sentinel; categorical -> "Unknown".
  • Useful for tree models and when missingness itself is informative.

3) Mean / Median / Mode

  • Mean: good for symmetric numeric distributions but sensitive to outliers.
  • Median: robust to outliers (useful after your outlier-handling step!).
  • Mode: for categorical variables.

Example (pandas):

# numeric median imputation
df['age'] = df['age'].fillna(df['age'].median())
# categorical mode
df['city'] = df['city'].fillna(df['city'].mode()[0])

Tip: If you previously detected and fixed outliers, prefer median after that step — outliers can skew the mean.


Time-series specific: forward/backward fill and interpolation

  • forward-fill (ffill) and backward-fill (bfill) preserve local continuity.
  • interpolation (linear, time) can be powerful for continuous sensor data.
# forward fill within groups
df.sort_values(['id', 'timestamp'], inplace=True)
df['value'] = df.groupby('id')['value'].ffill()

Group-wise imputation

When values differ by subgroup, impute with group statistics.

# fill missing income with median within each city
df['income'] = df.groupby('city')['income'].transform(
    lambda x: x.fillna(x.median())
)

This uses your pandas grouping and joins knowledge — it's safer than global imputation when subpopulations differ.


Advanced model-based imputation

kNN Imputer

  • Imputes based on nearest neighbors in feature space.
  • Good when relationships exist; sensitive to scaling and high dimensionality.

Iterative Imputer / MICE (Multiple Imputation by Chained Equations)

  • Builds a model per feature (e.g., BayesianRidge) and iteratively predicts missing values.
  • Produces more realistic imputations and can propagate uncertainty when repeated multiple times.
from sklearn.impute import KNNImputer, IterativeImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# KNN example
knn = make_pipeline(StandardScaler(), KNNImputer(n_neighbors=5))
X_knn = knn.fit_transform(X_train)

# Iterative example
mi = IterativeImputer(random_state=0)
X_iter = mi.fit_transform(X_train)

Important: Always fit imputers only on the training data inside cross-validation to avoid data leakage.


Categorical variables: don’t forget the weird ones

  • Replace placeholders ("NA", "none", "-", "missing") using regex before imputing: your previous string-methods and regex skills are a weapon here.
# normalize string placeholders
df['comment'] = df['comment'].replace(r'^(none|n/a|missing|-)$', pd.NA, regex=True)
# then fill
df['comment'] = df['comment'].fillna('Unknown')
  • Another option: target encoding with caution — you must avoid leakage.

Capture missingness as a signal

Frequently, the fact that a value is missing is predictive. Create binary indicators:

df['age_missing'] = df['age'].isna().astype(int)

Use these indicators along with your imputed values to give models both the filled value and the missing-flag.


Practical pipeline: sklearn-friendly approach

Use ColumnTransformer and Pipelines so imputations are repeatable and leak-free.

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

num_cols = ['age','income']
cat_cols = ['city','gender']

num_pipe = Pipeline([
    ('impute', SimpleImputer(strategy='median')),
    ('scale', StandardScaler())
])
cat_pipe = Pipeline([
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('num', num_pipe, num_cols),
    ('cat', cat_pipe, cat_cols)
])

model_pipeline = Pipeline([
    ('pre', preprocessor),
    ('clf', SomeModel())
])

This keeps imputation reproducible and safe inside cross-validation.


When to prefer multiple imputation

  • If missingness is large and you need proper uncertainty estimates, use multiple imputation (MICE repeated) and combine model estimates across imputed datasets.
  • Useful for inference and when you're not just chasing predictive accuracy.

Pitfalls & practical checks

  • Don’t leak: fit imputer only on training folds.
  • If imputing with mean/median, check distribution shifts between train/test.
  • Watch for unrealistic imputations (e.g., negative ages). Clip or constrain if needed.
  • Document which columns had imputation and how — this matters for reproducibility and debugging.

Quick decision checklist

  1. Is missingness tiny? -> Drop rows.
  2. Is it time-series? -> ffill/bfill or interpolation.
  3. Is subgroup variation large? -> group-wise median/mode.
  4. Is missingness informative? -> add missing indicator.
  5. Do you need uncertainty? -> multiple imputation (MICE).
  6. Are you in cross-validation? -> Always fit imputers inside training folds.

Key takeaways

  • Imputation is a balancing act: simplicity vs. realism.
  • Use domain knowledge: sometimes 0 or "Unknown" makes sense; sometimes modeling is required.
  • Always avoid leakage by fitting imputers only on training data.
  • Combine imputed values with missingness indicators when appropriate.
  • Use pipelines so your preprocessing is reproducible and robust.

Final memorable insight: Missing data isn’t just a nuisance — it’s a feature of your dataset. Treat it like evidence, not an annoyance.


If you want, I’ll generate a ready-to-run notebook that demonstrates: detecting placeholder strings with regex, group-wise median imputation, KNN imputer comparison, and a cross-validated pipeline that avoids leakage. Say the word and I'll code it up like a caffeinated TA.

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics