Data Analysis with pandas
Manipulate and analyze tabular data using pandas for indexing, joins, time series, and robust I/O.
Content
Handling Missing Values
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Handling Missing Values in pandas — Clean Your Data Like a Pro
Ever opened a dataset and felt like you’re reading tea leaves because half the values are NaN? Welcome to the party. Missing data is the awkward roommate of any real-world dataset: unavoidable, sometimes useful, mostly annoying — and if you ignore it your analysis will throw shade (and wrong answers).
You’ve already learned how to select rows and columns (Indexing & Selection) and filter/query your DataFrame — two skills that are essential here. Also recall your NumPy lessons: NaNs are special floating-point values (np.nan), and vectorized ops + boolean masks are your friends when fixing data at scale.
What this guide covers
- How to detect and quantify missing values
- Practical strategies: drop, fill, interpolate, flag, or model missingness
- Implementation patterns using pandas + NumPy, with code you can copy-paste
- Tips that prevent subtle bugs (dtype changes, data leakage, performance)
"Handling missing values is not just math — it's judgement. Know the data, then choose the method."
1) Find the holes: detect and measure missingness
Start by asking: Where are the NaNs, and how many?
import pandas as pd
import numpy as np
df = pd.DataFrame({
'id': [1,2,3,4],
'age': [25, np.nan, 35, 40],
'income': [50000, 60000, np.nan, 85000],
'group': ['A', 'A', 'B', 'B']
})
# Count missing per column
print(df.isna().sum())
# Quick overview
print(df.info())
Useful checks:
- df.isna().sum() — missing counts per column
- df.isna().mean() — fraction missing (nice for thresholds)
- df[df['col'].isna()] or df.query('col != col') — filter missing rows; use .loc for assignments
Relate to previous topics: use .loc and boolean masks from Indexing & Selection, or df.query from Filtering & query, to isolate missing rows and inspect patterns.
2) Decide: drop, fill, or model? (short checklist)
- Drop rows/columns if missingness is small or column is useless: df.dropna()
- Fill (impute) when you need to keep rows: df.fillna(value)
- Interpolate when there's continuity/time-series structure: df.interpolate()
- Groupwise impute when values depend on categories: df.groupby(...).transform(...)
- Model-based imputation for advanced cases (kNN, regression, iterative imputer)
Ask: Is missingness random (MCAR), depends on observed data (MAR), or depends on the missing value itself (MNAR)? The choice matters: blind mean-imputation can bias results.
3) Common patterns with code (practical recipes)
Drop rows or columns
# drop rows with any NaN
df_clean = df.dropna()
# drop columns with >50% missing
threshold = len(df) * 0.5
df = df.dropna(axis=1, thresh=threshold)
Fill with a constant or statistic
# fill numeric with mean, categorical with 'missing'
df['age'] = df['age'].fillna(df['age'].mean())
df['group'] = df['group'].fillna('missing')
Caveat: mean is sensitive to outliers and can bias downstream metrics.
Forward/backward fill (time series and ordered data)
df.sort_values('id', inplace=True)
# forward fill
df['income'] = df['income'].ffill()
# or backward fill
df['income'] = df['income'].bfill()
Interpolate numeric sequences
# linear interpolation
df['age'] = df['age'].interpolate(method='linear')
Group-wise imputation (useful and powerful)
# fill missing income by group mean
df['income'] = df.groupby('group')['income']
.transform(lambda x: x.fillna(x.mean()))
This uses grouping (recall Filtering & query and Indexing skills) to preserve within-group structure.
Conditional fill with NumPy for vectorized speed
# replace missing ages with median for efficient vectorized operation
median_age = df['age'].median()
df['age'] = np.where(df['age'].isna(), median_age, df['age'])
This leverages NumPy's vectorized np.where for speed (hello NumPy background!).
Flag missing values (create a sentinel feature)
# add boolean feature: was age missing?
df['age_was_missing'] = df['age'].isna().astype(int)
# then impute
df['age'] = df['age'].fillna(df['age'].median())
Flagging can preserve information about missingness itself, which is often predictive.
4) Dtype gotchas and modern pandas types
- Numeric columns with NaN become float64. If you want integers, use pandas' nullable integer dtype: 'Int64'. Example:
# after imputation
df['some_int'] = df['some_int'].fillna(0).astype('Int64')
Avoid using .apply row-wise for large DataFrames — it's slow. Prefer vectorized pandas/NumPy ops.
Be careful with inplace=True: assignment is often clearer and safer (inplace is being discouraged in some methods).
5) Advanced tips (short but powerful)
- Don’t leak: when preparing training/test splits, fit imputers only on training data to avoid leaking information from the test set.
- Use scikit-learn’s SimpleImputer or IterativeImputer for pipeline-friendly, reproducible imputations.
- For categorical features, a special category like 'MISSING' often works better than mode imputation.
- Visualize missingness patterns with missingno or seaborn heatmaps — patterns can reveal systematic problems.
Quick checklist before modeling
- Did I quantify missingness per column and per row?
- Did I inspect whether missingness correlates with other variables? (possible bias)
- Did I avoid leaking test data when imputing?
- Did I choose imputation method that respects data type and distribution?
- Did I consider flagging missingness as a feature?
Key takeaways
- Missing values are common; detection (df.isna()) is the first step.
- Use vectorized pandas/NumPy operations — avoid per-row apply where possible.
- Choose strategy based on domain knowledge: drop, fill, interpolate, or model-based imputation.
- Preserve dtype when needed (pandas nullable dtypes) and avoid leakage in ML workflows.
"A dataset without NaNs is like a calm lake — but don’t ignore the rocks under the surface. Inspect, then act."
Go forth and clean! If you want, I can show how to integrate imputations into a scikit-learn Pipeline or demonstrate model-based imputation examples next.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!