Data Wrangling and Feature Engineering
Practical techniques to clean, encode, scale, and construct informative features while avoiding leakage.
Content
Handling Missing Values
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Handling Missing Values — The Emotional and Practical Makeover for Your Dataset
"Missing data isn't broken data — it's data with feelings. Treat it kindly, or your model will ghost you at deployment." — Your slightly dramatic TA
You're past the awkward stage where we talked about tidy structure and data types (remember that thrilling saga?), and you know what labels are and whether you're doing regression or classification. Now we face a reality check: real datasets have holes. Lots of holes. Some are innocent, some are lying, and some are screaming useful information at you through the void.
This guide gives you the who/why/how of missing values: how to detect them, when to impute, when to engineer missingness as a feature, and how to do all of it without leaking your validation data or accidentally teaching your model to be a fortune teller.
Quick taxonomy: Why values are missing (this matters)
- MCAR — Missing Completely At Random: The missingness has no relationship to observed or unobserved data. Example: a sensor randomly dropped a reading during transmission.
- MAR — Missing At Random: The missingness depends on observed data. Example: income is missing more often for younger respondents (age observed).
- MNAR — Missing Not At Random: Missingness depends on the unobserved value itself. Example: people with very high incomes are less likely to report income.
Why care? Because strategy changes. Imputing blindly assumes something about the missingness mechanism.
First things first: detect & diagnose
- Get counts and percents
import pandas as pd
missing = df.isnull().sum().sort_values(ascending=False)
missing_percent = (missing / len(df) * 100).round(2)
pd.concat([missing, missing_percent], axis=1, keys=["n_missing", "%"])
- Visualize patterns
- Heatmaps (sns.heatmap(df.isnull()...))
- Missingness matrix (missingno.matrix)
- Pairwise patterns (missingno.heatmap or seaborn clustermap)
- Correlate missingness with target or other columns
# create a missing indicator and check correlation with target
df['age_missing'] = df['age'].isnull().astype(int)
df.groupby('age_missing')['target'].mean()
If missingness correlates with the target, you just found a feature.
Decision tree: Drop? Impute? Feature engineer?
- If a column has >~50% missing and little predictive power: consider dropping (unless domain says otherwise).
- If rows with missingness are a tiny fraction and appear MCAR: dropping rows is okay for many models.
- If missingness seems informative (MAR/MNAR): don't drop. Engineer it.
Rule of thumb: the cost of losing data vs. the risk of wrong imputation.
Basic imputation methods (fast, explainable)
- Mean/Median (numeric): good baseline, median is robust to outliers — use for MCAR or when computational simplicity matters.
- Mode (categorical): common-sense fill for categories.
- Constant fill (e.g., -999, "Unknown"): handy for tree models; beware scaling issues for linear models.
- Forward/backward fill (time-series): fill using previous/next value; only for temporally-ordered data.
Pros: simple, fast, reproducible. Cons: underestimates variance, may bias relationships.
Advanced imputation (for when you care about quality)
- KNN Imputer: imputes using nearest neighbors (good for mixed-patterns; sensitive to scaling).
- Multiple Imputation by Chained Equations (MICE / IterativeImputer): fits models for each feature conditional on others, iteratively. Preserves relationships better.
- Matrix factorization / SVD / SoftImpute: good for high-dimensional structured data (e.g., recommender systems).
- Model-based imputation: train a regression/classifier to predict the missing feature from others.
These preserve correlations but are computationally heavier and can leak if not done inside proper CV pipelines.
Important: avoid leakage — always impute inside the pipeline
This is non-negotiable if you want honest evaluation.
- Fit your imputer (mean, iterative, etc.) on the training folds only.
- Use sklearn Pipelines/ColumnTransformer so transformations occur within cross-validation and during deployment.
Example (scikit-learn):
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer, IterativeImputer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
num_cols = ['age','income']
cat_cols = ['gender','region']
num_pipe = Pipeline([('impute', IterativeImputer()), ('scale', StandardScaler())])
cat_pipe = Pipeline([('impute', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore'))])
preproc = ColumnTransformer([('num', num_pipe, num_cols), ('cat', cat_pipe, cat_cols)])
pipe = Pipeline([('preproc', preproc), ('clf', RandomForestClassifier())])
scores = cross_val_score(pipe, X, y, cv=5)
If you run imputation before cross-validation, your model gets to peek at validation data — and that's how you accidentally build a cheat code.
Treat missingness as a feature — a surprisingly sexy move
Often, missingness tells a story:
- A null lab result could mean the doctor didn't order the test because they judged it unnecessary — that’s predictive.
- A blank address might indicate homelessness — also predictive for certain outcomes.
Create indicators:
- Binary flags (is_missing_age)
- Aggregated counters (n_missing_features)
- Time-since-last-observed for time series
These let your model learn that "missing" itself is meaningful instead of being an awkward patch.
Categorical variables: special considerations
- Don't impute categorical variables with mean. Use mode, or a new category like "Missing".
- If you one-hot encode, keep the missing category separate so the model can use it.
- For rare categories, consider grouping into "Other" before imputation.
Practical heuristics & checklist
- Inspect: amounts, patterns, relation to target.
- Decide: drop variable / drop rows / impute / engineer indicator.
- Implement: use pipelines; fit imputers only on training data.
- Validate: do sensitivity analysis (try multiple strategies & compare). If results swing wildly, investigate why.
- Document: record assumptions (MCAR vs MAR vs MNAR), because future you will need that explanation.
Quick comparison table
| Method | Pros | Cons | When to use |
|---|---|---|---|
| Drop rows | Simple, unbiased if MCAR | Wastes data, biased if MAR/MNAR | Very small % missing & MCAR |
| Mean/Median | Fast, interpretable | Shrinks variance, biases relationships | Baseline, numeric MCAR |
| Constant fill | Works with tree models | Can create outliers; breaks linear models | Tree models; when missingness is informative |
| KNN | Preserves local structure | Slow, sensitive to scaling | Small/medium datasets with structure |
| MICE | Preserves multivariate relationships | Complex, iterative, heavy | When relationships matter (regression tasks) |
Closing rant (quick & useful)
Missing values are not just a nuisance — they're a diagnostic tool and potential signal. Treat them like clues in a detective novel, not ants to be stomped with a mean imputer. Use pipelines to avoid leakage, consider missingness indicators, and pick an imputation method that matches your missingness assumptions and computational budget.
Remember: whether you're doing regression predicting house prices or classification predicting churn, sloppy handling of missing data will quietly turn your evaluation metrics into fantasy. Handle carefully, validate robustly, and keep a log of the choices (future you will be thankful).
Go forth and heal those datasets. Your model (and your future self) will thank you.
Version note: This builds on tidy data, correct types, and the basics of supervised learning — now we make your data whole enough to be useful without making it a liar.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!