Data Cleaning and Feature Engineering
Prepare high-quality datasets with robust transformations and informative features while avoiding leakage.
Content
Imputation Strategies
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Imputation Strategies — Fill the Holes Without Breaking Your Model
"This is the moment where the concept finally clicks." — when you stop treating missing data like a problem to ignore and start treating it like information to respect.
If you just finished cleaning outliers and wrangling data from SQL tables with pandas (and maybe used regex to unearth sketchy placeholder strings), congratulations — you’re ready for the next battlefield: missing values. Imputation is the art (and science) of filling gaps in your dataset so models can do their work without tripping over NaNs.
This guide builds on those earlier lessons: use your pandas indexing and joins knowledge to prepare groups for group-wise imputation, and your string-methods/regex toolkit to detect placeholder values ("N/A", "none", "-999") before you start imputing.
Why imputation matters
- Many ML algorithms can't accept NaNs.
- Improper imputation leaks information or biases estimates.
- Different missingness mechanisms (MCAR/MAR/MNAR) demand different strategies.
Quick definitions (the three musketeers of missingness)
- MCAR (Missing Completely At Random): missingness unrelated to data. Rare but easiest.
- MAR (Missing At Random): missingness depends on observed data (e.g., older patients more likely to skip a test).
- MNAR (Missing Not At Random): missingness depends on the missing values themselves (e.g., high incomes hide). Hard to fix—requires modeling the missingness itself.
Basic imputation techniques (start here)
1) Drop rows/columns
- Use only when missingness is tiny or entire column is useless.
- Fast, but potentially wasteful.
2) Constant replacement
- Numeric -> 0 or some sentinel; categorical -> "Unknown".
- Useful for tree models and when missingness itself is informative.
3) Mean / Median / Mode
- Mean: good for symmetric numeric distributions but sensitive to outliers.
- Median: robust to outliers (useful after your outlier-handling step!).
- Mode: for categorical variables.
Example (pandas):
# numeric median imputation
df['age'] = df['age'].fillna(df['age'].median())
# categorical mode
df['city'] = df['city'].fillna(df['city'].mode()[0])
Tip: If you previously detected and fixed outliers, prefer median after that step — outliers can skew the mean.
Time-series specific: forward/backward fill and interpolation
- forward-fill (ffill) and backward-fill (bfill) preserve local continuity.
- interpolation (linear, time) can be powerful for continuous sensor data.
# forward fill within groups
df.sort_values(['id', 'timestamp'], inplace=True)
df['value'] = df.groupby('id')['value'].ffill()
Group-wise imputation
When values differ by subgroup, impute with group statistics.
# fill missing income with median within each city
df['income'] = df.groupby('city')['income'].transform(
lambda x: x.fillna(x.median())
)
This uses your pandas grouping and joins knowledge — it's safer than global imputation when subpopulations differ.
Advanced model-based imputation
kNN Imputer
- Imputes based on nearest neighbors in feature space.
- Good when relationships exist; sensitive to scaling and high dimensionality.
Iterative Imputer / MICE (Multiple Imputation by Chained Equations)
- Builds a model per feature (e.g., BayesianRidge) and iteratively predicts missing values.
- Produces more realistic imputations and can propagate uncertainty when repeated multiple times.
from sklearn.impute import KNNImputer, IterativeImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
# KNN example
knn = make_pipeline(StandardScaler(), KNNImputer(n_neighbors=5))
X_knn = knn.fit_transform(X_train)
# Iterative example
mi = IterativeImputer(random_state=0)
X_iter = mi.fit_transform(X_train)
Important: Always fit imputers only on the training data inside cross-validation to avoid data leakage.
Categorical variables: don’t forget the weird ones
- Replace placeholders ("NA", "none", "-", "missing") using regex before imputing: your previous string-methods and regex skills are a weapon here.
# normalize string placeholders
df['comment'] = df['comment'].replace(r'^(none|n/a|missing|-)$', pd.NA, regex=True)
# then fill
df['comment'] = df['comment'].fillna('Unknown')
- Another option: target encoding with caution — you must avoid leakage.
Capture missingness as a signal
Frequently, the fact that a value is missing is predictive. Create binary indicators:
df['age_missing'] = df['age'].isna().astype(int)
Use these indicators along with your imputed values to give models both the filled value and the missing-flag.
Practical pipeline: sklearn-friendly approach
Use ColumnTransformer and Pipelines so imputations are repeatable and leak-free.
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
num_cols = ['age','income']
cat_cols = ['city','gender']
num_pipe = Pipeline([
('impute', SimpleImputer(strategy='median')),
('scale', StandardScaler())
])
cat_pipe = Pipeline([
('impute', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer([
('num', num_pipe, num_cols),
('cat', cat_pipe, cat_cols)
])
model_pipeline = Pipeline([
('pre', preprocessor),
('clf', SomeModel())
])
This keeps imputation reproducible and safe inside cross-validation.
When to prefer multiple imputation
- If missingness is large and you need proper uncertainty estimates, use multiple imputation (MICE repeated) and combine model estimates across imputed datasets.
- Useful for inference and when you're not just chasing predictive accuracy.
Pitfalls & practical checks
- Don’t leak: fit imputer only on training folds.
- If imputing with mean/median, check distribution shifts between train/test.
- Watch for unrealistic imputations (e.g., negative ages). Clip or constrain if needed.
- Document which columns had imputation and how — this matters for reproducibility and debugging.
Quick decision checklist
- Is missingness tiny? -> Drop rows.
- Is it time-series? -> ffill/bfill or interpolation.
- Is subgroup variation large? -> group-wise median/mode.
- Is missingness informative? -> add missing indicator.
- Do you need uncertainty? -> multiple imputation (MICE).
- Are you in cross-validation? -> Always fit imputers inside training folds.
Key takeaways
- Imputation is a balancing act: simplicity vs. realism.
- Use domain knowledge: sometimes 0 or "Unknown" makes sense; sometimes modeling is required.
- Always avoid leakage by fitting imputers only on training data.
- Combine imputed values with missingness indicators when appropriate.
- Use pipelines so your preprocessing is reproducible and robust.
Final memorable insight: Missing data isn’t just a nuisance — it’s a feature of your dataset. Treat it like evidence, not an annoyance.
If you want, I’ll generate a ready-to-run notebook that demonstrates: detecting placeholder strings with regex, group-wise median imputation, KNN imputer comparison, and a cross-validated pipeline that avoids leakage. Say the word and I'll code it up like a caffeinated TA.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!