Data Cleaning and Feature Engineering
Prepare high-quality datasets and craft informative features using robust, repeatable pipelines.
Content
Imputation Strategies
Versions:
Watch & Learn
AI-discovered learning video
Imputation Strategies — The Patch-Up Party Your Dataset Secretly Needs
"Data doesn't go missing because it's shy — it goes missing because something in the process broke. Your job: make the data whole enough for the model to stop crying." — Your wildly dramatic TA
You've already learned how to inspect data quality (remember: missingness patterns, weird dtypes) and spot outliers (the boundary-throwing rebels). You can manipulate arrays and tables like a sorcerer with NumPy and Pandas. Now we level up: when values are missing, what do you do besides pleading with the dataset? Welcome to imputation — the art of filling holes without creating statistical Frankenstein monsters.
Why imputation matters (and why deletion is not always the hero)
- Dropping rows with missing values is easy, but often wasteful — you may lose valuable signal, introduce bias, or shrink your sample to uselessness.
- Imputation aims to restore data so downstream models can learn without being derailed by NaNs.
Quick link-back: from Data Quality Assessment you should already know whether missingness looks random or structured. That informs which imputation strategy won't lie to your model.
Know thy enemy: Missingness mechanisms (short & spicy)
- MCAR (Missing Completely At Random): missingness is independent of data. Treatable with simpler methods.
- MAR (Missing At Random): missingness depends on observed data (e.g., younger ppl more likely to skip income). Use conditional or model-based methods.
- MNAR (Missing Not At Random): missingness depends on the missing value itself (sneaky — e.g., people hide high incomes). Requires careful domain work or explicit modeling of the missingness process.
Ask: "Does the pattern of missing values correlate with other columns?" If yes → likely MAR.
The Imputation Arsenal (what to try, when, and why)
1) Do nothing tactically
- Add a missing indicator column (e.g.,
col_is_missing) to capture the fact that something was missing. Useful with model-based imputation.
2) Simple statistic imputation (mean/median/mode)
- Code (Pandas):
# numeric
df['age'] = df['age'].fillna(df['age'].median())
# categorical
df['city'] = df['city'].fillna(df['city'].mode()[0])
- Use when: MCAR, quick baselines, or when distribution is symmetric (mean) or skewed (median).
- Watch out: shrinks variance, can bias estimates if missingness not MCAR.
3) Group-wise imputation
- Use aggregate values within groups (e.g., median within
occupation):
df['salary'] = df.groupby('occupation')['salary'].transform(lambda x: x.fillna(x.median()))
- Use when different segments have different distributions (builds on your DataFrame manipulation skills).
4) Forward/backward fill & interpolation (time-series friendly)
df['value'].ffill()ordf['value'].interpolate(method='linear')- Use when datapoints are ordered (time series) and missingness is short gaps.
5) KNN Imputation
- Uses nearest neighbors in feature space to infer missing values.
- From scikit-learn:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df[numeric_cols] = imputer.fit_transform(df[numeric_cols])
- Good when local structure matters. Requires scaling, sensitive to irrelevant features.
6) Iterative / Model-based imputation (MICE, IterativeImputer)
- Build predictive models for each feature with missing data, iteratively filling in values.
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imputer = IterativeImputer(random_state=0)
df[numeric_cols] = imputer.fit_transform(df[numeric_cols])
- Powerful, preserves multivariate relationships, but computationally heavier and can overfit if not careful.
7) Domain-driven / custom imputation
- e.g., replace missing
temperaturewith sensor-specific historical mean, or mark with sentinel values if missingness is informative. - When business logic dictates substitution.
A compact comparison table
| Strategy | Pros | Cons | When to use |
|---|---|---|---|
| Drop rows | Simple | Wastes data, can bias | Tiny % missing, MCAR |
| Mean/Median/Mode | Fast, simple | Reduces variance | Baseline, MCAR |
| Group-wise | Respects segment differences | Needs good grouping | MAR by group |
| Interpolation | Keeps temporal continuity | Not for non-temporal | Time series |
| KNN | Nonlinear local patterns | Sensitive to scaling | Local structure |
| Iterative/MICE | Preserves multivariate links | Compute heavy, risk of leakage | Complex MAR situations |
Practical workflow: How I decide (step-by-step)
- Assess missingness (from Data Quality Assessment): fraction missing, patterns, correlations.
- Decide if dropping is acceptable: if <1% and MCAR — maybe drop. Otherwise, impute.
- Choose simple first: mean/median or group-wise to baseline model performance.
- Add missing indicators for columns you impute — missingness itself can be predictive.
- Try smarter methods: KNN or Iterative if baseline performs poorly or if relationships matter.
- Validate: use cross-validation and compare models trained on different imputations. Check downstream metric and distributional changes.
Ask yourself: "Does this imputation change the distribution or relationships in ways that would mislead my model?"
Interaction with Outliers
Outliers (you learned earlier) can ruin mean imputation. Use robust statistics (median, trimmed mean) or cap outliers before computing imputation statistics. Conversely, imputation can create outliers — always re-check distributions after filling.
Mini check-list before you ship your dataset
- Did I add missing indicators where appropriate?
- Did I choose an imputation method consistent with missingness mechanism?
- Did I scale/features-engineer before model-based imputation when required?
- Did I validate via cross-val and inspect distributions after imputation?
- Did I avoid leaking target information into imputation (no peeking)?
Closing: The emotional arc of imputation
Imputation is part stat, part empathy: you're guessing what the data would've said if it hadn’t ghosted you. Start simple, validate thoroughly, and remember: more complex imputation is not always better. Use domain knowledge as your north star — models will forgive clever math, but they still prefer truth.
Imputation isn't about pretending missing values never happened; it's about giving your model enough honest, defensible information to stop flailing.
Key takeaways
- Identify missingness type (MCAR/MAR/MNAR). Use that to pick strategy.
- Start with simple methods, add missingness indicators, then escalate to model-based imputation if needed.
- Watch out for outliers, leakage, and altered distributions.
Now go forth: patch your dataset, outsmart the NaNs, and then treat your cleaned data to a nice visualization — you've earned it. (And if a model still misbehaves? You may have to interrogate the data collection process — or drink more coffee.)
Version note: This builds directly on your prior work with Pandas/NumPy and earlier modules on Data Quality and Outlier Detection — use those skills to inspect and pre-process before you impute.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!