Courses/Python for Data Science, AI & Development/Data Analysis with pandas

Data Analysis with pandas

42410 views

Manipulate and analyze tabular data using pandas for indexing, joins, time series, and robust I/O.

Content

5 of 15

Handling Missing Values

Handling Missing Values in pandas: Clean Data for Analysis

4827 views

beginner

pandas

data-cleaning

python

humorous

gpt-5-mini

4827 views

Versions:

Handling Missing Values in pandas: Clean Data for Analysis

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Handling Missing Values in pandas — Clean Your Data Like a Pro

Ever opened a dataset and felt like you’re reading tea leaves because half the values are NaN? Welcome to the party. Missing data is the awkward roommate of any real-world dataset: unavoidable, sometimes useful, mostly annoying — and if you ignore it your analysis will throw shade (and wrong answers).

You’ve already learned how to select rows and columns (Indexing & Selection) and filter/query your DataFrame — two skills that are essential here. Also recall your NumPy lessons: NaNs are special floating-point values (np.nan), and vectorized ops + boolean masks are your friends when fixing data at scale.

What this guide covers

How to detect and quantify missing values
Practical strategies: drop, fill, interpolate, flag, or model missingness
Implementation patterns using pandas + NumPy, with code you can copy-paste
Tips that prevent subtle bugs (dtype changes, data leakage, performance)

"Handling missing values is not just math — it's judgement. Know the data, then choose the method."

1) Find the holes: detect and measure missingness

Start by asking: Where are the NaNs, and how many?

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'id': [1,2,3,4],
    'age': [25, np.nan, 35, 40],
    'income': [50000, 60000, np.nan, 85000],
    'group': ['A', 'A', 'B', 'B']
})

# Count missing per column
print(df.isna().sum())

# Quick overview
print(df.info())

Useful checks:

df.isna().sum() — missing counts per column
df.isna().mean() — fraction missing (nice for thresholds)
df[df['col'].isna()] or df.query('col != col') — filter missing rows; use .loc for assignments

Relate to previous topics: use .loc and boolean masks from Indexing & Selection, or df.query from Filtering & query, to isolate missing rows and inspect patterns.

2) Decide: drop, fill, or model? (short checklist)

Drop rows/columns if missingness is small or column is useless: df.dropna()
Fill (impute) when you need to keep rows: df.fillna(value)
Interpolate when there's continuity/time-series structure: df.interpolate()
Groupwise impute when values depend on categories: df.groupby(...).transform(...)
Model-based imputation for advanced cases (kNN, regression, iterative imputer)

Ask: Is missingness random (MCAR), depends on observed data (MAR), or depends on the missing value itself (MNAR)? The choice matters: blind mean-imputation can bias results.

3) Common patterns with code (practical recipes)

Drop rows or columns

# drop rows with any NaN
df_clean = df.dropna()

# drop columns with >50% missing
threshold = len(df) * 0.5
df = df.dropna(axis=1, thresh=threshold)

Fill with a constant or statistic

# fill numeric with mean, categorical with 'missing'
df['age'] = df['age'].fillna(df['age'].mean())
df['group'] = df['group'].fillna('missing')

Caveat: mean is sensitive to outliers and can bias downstream metrics.

Forward/backward fill (time series and ordered data)

df.sort_values('id', inplace=True)
# forward fill
df['income'] = df['income'].ffill()
# or backward fill
df['income'] = df['income'].bfill()

Interpolate numeric sequences

# linear interpolation
df['age'] = df['age'].interpolate(method='linear')

Group-wise imputation (useful and powerful)

# fill missing income by group mean
df['income'] = df.groupby('group')['income']
                 .transform(lambda x: x.fillna(x.mean()))

This uses grouping (recall Filtering & query and Indexing skills) to preserve within-group structure.

Conditional fill with NumPy for vectorized speed

# replace missing ages with median for efficient vectorized operation
median_age = df['age'].median()
df['age'] = np.where(df['age'].isna(), median_age, df['age'])

This leverages NumPy's vectorized np.where for speed (hello NumPy background!).

Flag missing values (create a sentinel feature)

# add boolean feature: was age missing?
df['age_was_missing'] = df['age'].isna().astype(int)
# then impute
df['age'] = df['age'].fillna(df['age'].median())

Flagging can preserve information about missingness itself, which is often predictive.

4) Dtype gotchas and modern pandas types

Numeric columns with NaN become float64. If you want integers, use pandas' nullable integer dtype: 'Int64'. Example:

# after imputation
df['some_int'] = df['some_int'].fillna(0).astype('Int64')

Avoid using .apply row-wise for large DataFrames — it's slow. Prefer vectorized pandas/NumPy ops.
Be careful with inplace=True: assignment is often clearer and safer (inplace is being discouraged in some methods).

5) Advanced tips (short but powerful)

Don’t leak: when preparing training/test splits, fit imputers only on training data to avoid leaking information from the test set.
Use scikit-learn’s SimpleImputer or IterativeImputer for pipeline-friendly, reproducible imputations.
For categorical features, a special category like 'MISSING' often works better than mode imputation.
Visualize missingness patterns with missingno or seaborn heatmaps — patterns can reveal systematic problems.

Quick checklist before modeling

Did I quantify missingness per column and per row?
Did I inspect whether missingness correlates with other variables? (possible bias)
Did I avoid leaking test data when imputing?
Did I choose imputation method that respects data type and distribution?
Did I consider flagging missingness as a feature?

Key takeaways

Missing values are common; detection (df.isna()) is the first step.
Use vectorized pandas/NumPy operations — avoid per-row apply where possible.
Choose strategy based on domain knowledge: drop, fill, interpolate, or model-based imputation.
Preserve dtype when needed (pandas nullable dtypes) and avoid leakage in ML workflows.

"A dataset without NaNs is like a calm lake — but don’t ignore the rocks under the surface. Inspect, then act."

Go forth and clean! If you want, I can show how to integrate imputations into a scikit-learn Pipeline or demonstrate model-based imputation examples next.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics