Courses/Python for Data Science, AI & Development/Data Cleaning and Feature Engineering

Data Cleaning and Feature Engineering

43380 views

Prepare high-quality datasets with robust transformations and informative features while avoiding leakage.

Content

2 of 15

Imputation Strategies

Imputation Strategies for Data Cleaning in Python: Practical Tips

4196 views

beginner

intermediate

python

data-cleaning

feature-engineering

gpt-5-mini

4196 views

Versions:

Imputation Strategies for Data Cleaning in Python: Practical Tips

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Imputation Strategies — Fill the Holes Without Breaking Your Model

"This is the moment where the concept finally clicks." — when you stop treating missing data like a problem to ignore and start treating it like information to respect.

If you just finished cleaning outliers and wrangling data from SQL tables with pandas (and maybe used regex to unearth sketchy placeholder strings), congratulations — you’re ready for the next battlefield: missing values. Imputation is the art (and science) of filling gaps in your dataset so models can do their work without tripping over NaNs.

This guide builds on those earlier lessons: use your pandas indexing and joins knowledge to prepare groups for group-wise imputation, and your string-methods/regex toolkit to detect placeholder values ("N/A", "none", "-999") before you start imputing.

Why imputation matters

Many ML algorithms can't accept NaNs.
Improper imputation leaks information or biases estimates.
Different missingness mechanisms (MCAR/MAR/MNAR) demand different strategies.

Quick definitions (the three musketeers of missingness)

MCAR (Missing Completely At Random): missingness unrelated to data. Rare but easiest.
MAR (Missing At Random): missingness depends on observed data (e.g., older patients more likely to skip a test).
MNAR (Missing Not At Random): missingness depends on the missing values themselves (e.g., high incomes hide). Hard to fix—requires modeling the missingness itself.

Basic imputation techniques (start here)

1) Drop rows/columns

Use only when missingness is tiny or entire column is useless.
Fast, but potentially wasteful.

2) Constant replacement

Numeric -> 0 or some sentinel; categorical -> "Unknown".
Useful for tree models and when missingness itself is informative.

3) Mean / Median / Mode

Mean: good for symmetric numeric distributions but sensitive to outliers.
Median: robust to outliers (useful after your outlier-handling step!).
Mode: for categorical variables.

Example (pandas):

# numeric median imputation
df['age'] = df['age'].fillna(df['age'].median())
# categorical mode
df['city'] = df['city'].fillna(df['city'].mode()[0])

Tip: If you previously detected and fixed outliers, prefer median after that step — outliers can skew the mean.

Time-series specific: forward/backward fill and interpolation

forward-fill (ffill) and backward-fill (bfill) preserve local continuity.
interpolation (linear, time) can be powerful for continuous sensor data.

# forward fill within groups
df.sort_values(['id', 'timestamp'], inplace=True)
df['value'] = df.groupby('id')['value'].ffill()

Group-wise imputation

When values differ by subgroup, impute with group statistics.

# fill missing income with median within each city
df['income'] = df.groupby('city')['income'].transform(
    lambda x: x.fillna(x.median())
)

This uses your pandas grouping and joins knowledge — it's safer than global imputation when subpopulations differ.

Advanced model-based imputation

kNN Imputer

Imputes based on nearest neighbors in feature space.
Good when relationships exist; sensitive to scaling and high dimensionality.

Iterative Imputer / MICE (Multiple Imputation by Chained Equations)

Builds a model per feature (e.g., BayesianRidge) and iteratively predicts missing values.
Produces more realistic imputations and can propagate uncertainty when repeated multiple times.

from sklearn.impute import KNNImputer, IterativeImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# KNN example
knn = make_pipeline(StandardScaler(), KNNImputer(n_neighbors=5))
X_knn = knn.fit_transform(X_train)

# Iterative example
mi = IterativeImputer(random_state=0)
X_iter = mi.fit_transform(X_train)

Important: Always fit imputers only on the training data inside cross-validation to avoid data leakage.

Categorical variables: don’t forget the weird ones

Replace placeholders ("NA", "none", "-", "missing") using regex before imputing: your previous string-methods and regex skills are a weapon here.

# normalize string placeholders
df['comment'] = df['comment'].replace(r'^(none|n/a|missing|-)$', pd.NA, regex=True)
# then fill
df['comment'] = df['comment'].fillna('Unknown')

Another option: target encoding with caution — you must avoid leakage.

Capture missingness as a signal

Frequently, the fact that a value is missing is predictive. Create binary indicators:

df['age_missing'] = df['age'].isna().astype(int)

Use these indicators along with your imputed values to give models both the filled value and the missing-flag.

Practical pipeline: sklearn-friendly approach

Use ColumnTransformer and Pipelines so imputations are repeatable and leak-free.

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

num_cols = ['age','income']
cat_cols = ['city','gender']

num_pipe = Pipeline([
    ('impute', SimpleImputer(strategy='median')),
    ('scale', StandardScaler())
])
cat_pipe = Pipeline([
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('num', num_pipe, num_cols),
    ('cat', cat_pipe, cat_cols)
])

model_pipeline = Pipeline([
    ('pre', preprocessor),
    ('clf', SomeModel())
])

This keeps imputation reproducible and safe inside cross-validation.

When to prefer multiple imputation

If missingness is large and you need proper uncertainty estimates, use multiple imputation (MICE repeated) and combine model estimates across imputed datasets.
Useful for inference and when you're not just chasing predictive accuracy.

Pitfalls & practical checks

Don’t leak: fit imputer only on training folds.
If imputing with mean/median, check distribution shifts between train/test.
Watch for unrealistic imputations (e.g., negative ages). Clip or constrain if needed.
Document which columns had imputation and how — this matters for reproducibility and debugging.

Quick decision checklist

Is missingness tiny? -> Drop rows.
Is it time-series? -> ffill/bfill or interpolation.
Is subgroup variation large? -> group-wise median/mode.
Is missingness informative? -> add missing indicator.
Do you need uncertainty? -> multiple imputation (MICE).
Are you in cross-validation? -> Always fit imputers inside training folds.

Key takeaways

Imputation is a balancing act: simplicity vs. realism.
Use domain knowledge: sometimes 0 or "Unknown" makes sense; sometimes modeling is required.
Always avoid leakage by fitting imputers only on training data.
Combine imputed values with missingness indicators when appropriate.
Use pipelines so your preprocessing is reproducible and robust.

Final memorable insight: Missing data isn’t just a nuisance — it’s a feature of your dataset. Treat it like evidence, not an annoyance.

If you want, I’ll generate a ready-to-run notebook that demonstrates: detecting placeholder strings with regex, group-wise median imputation, KNN imputer comparison, and a cross-validated pipeline that avoids leakage. Say the word and I'll code it up like a caffeinated TA.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics