jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Python for Data Science, AI & Development
Chapters

1Python Foundations for Data Work

2Data Structures and Iteration

3Numerical Computing with NumPy

4Data Analysis with pandas

Series and DataFrame BasicsReading CSV and ExcelIndexing and SelectionFiltering and queryHandling Missing ValuesType Conversion and CategoriesSorting and RankingGroupBy and AggregationsApply and Vectorized OpsMerge, Join, and ConcatPivot Tables and CrosstabsTime Series with pandasWindow and Rolling OpsString Methods and RegexDatabase I/O with SQLAlchemy

5Data Cleaning and Feature Engineering

6Data Visualization and Storytelling

7Statistics and Probability for Data Science

8Machine Learning with scikit-learn

9Deep Learning Foundations

10Data Sources, Engineering, and Deployment

Courses/Python for Data Science, AI & Development/Data Analysis with pandas

Data Analysis with pandas

42399 views

Manipulate and analyze tabular data using pandas for indexing, joins, time series, and robust I/O.

Content

5 of 15

Handling Missing Values

Handling Missing Values in pandas: Clean Data for Analysis
4826 views
beginner
pandas
data-cleaning
python
humorous
gpt-5-mini
4826 views

Versions:

Handling Missing Values in pandas: Clean Data for Analysis

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Handling Missing Values in pandas — Clean Your Data Like a Pro

Ever opened a dataset and felt like you’re reading tea leaves because half the values are NaN? Welcome to the party. Missing data is the awkward roommate of any real-world dataset: unavoidable, sometimes useful, mostly annoying — and if you ignore it your analysis will throw shade (and wrong answers).

You’ve already learned how to select rows and columns (Indexing & Selection) and filter/query your DataFrame — two skills that are essential here. Also recall your NumPy lessons: NaNs are special floating-point values (np.nan), and vectorized ops + boolean masks are your friends when fixing data at scale.


What this guide covers

  • How to detect and quantify missing values
  • Practical strategies: drop, fill, interpolate, flag, or model missingness
  • Implementation patterns using pandas + NumPy, with code you can copy-paste
  • Tips that prevent subtle bugs (dtype changes, data leakage, performance)

"Handling missing values is not just math — it's judgement. Know the data, then choose the method."


1) Find the holes: detect and measure missingness

Start by asking: Where are the NaNs, and how many?

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'id': [1,2,3,4],
    'age': [25, np.nan, 35, 40],
    'income': [50000, 60000, np.nan, 85000],
    'group': ['A', 'A', 'B', 'B']
})

# Count missing per column
print(df.isna().sum())

# Quick overview
print(df.info())

Useful checks:

  • df.isna().sum() — missing counts per column
  • df.isna().mean() — fraction missing (nice for thresholds)
  • df[df['col'].isna()] or df.query('col != col') — filter missing rows; use .loc for assignments

Relate to previous topics: use .loc and boolean masks from Indexing & Selection, or df.query from Filtering & query, to isolate missing rows and inspect patterns.


2) Decide: drop, fill, or model? (short checklist)

  • Drop rows/columns if missingness is small or column is useless: df.dropna()
  • Fill (impute) when you need to keep rows: df.fillna(value)
  • Interpolate when there's continuity/time-series structure: df.interpolate()
  • Groupwise impute when values depend on categories: df.groupby(...).transform(...)
  • Model-based imputation for advanced cases (kNN, regression, iterative imputer)

Ask: Is missingness random (MCAR), depends on observed data (MAR), or depends on the missing value itself (MNAR)? The choice matters: blind mean-imputation can bias results.


3) Common patterns with code (practical recipes)

Drop rows or columns

# drop rows with any NaN
df_clean = df.dropna()

# drop columns with >50% missing
threshold = len(df) * 0.5
df = df.dropna(axis=1, thresh=threshold)

Fill with a constant or statistic

# fill numeric with mean, categorical with 'missing'
df['age'] = df['age'].fillna(df['age'].mean())
df['group'] = df['group'].fillna('missing')

Caveat: mean is sensitive to outliers and can bias downstream metrics.

Forward/backward fill (time series and ordered data)

df.sort_values('id', inplace=True)
# forward fill
df['income'] = df['income'].ffill()
# or backward fill
df['income'] = df['income'].bfill()

Interpolate numeric sequences

# linear interpolation
df['age'] = df['age'].interpolate(method='linear')

Group-wise imputation (useful and powerful)

# fill missing income by group mean
df['income'] = df.groupby('group')['income']
                 .transform(lambda x: x.fillna(x.mean()))

This uses grouping (recall Filtering & query and Indexing skills) to preserve within-group structure.

Conditional fill with NumPy for vectorized speed

# replace missing ages with median for efficient vectorized operation
median_age = df['age'].median()
df['age'] = np.where(df['age'].isna(), median_age, df['age'])

This leverages NumPy's vectorized np.where for speed (hello NumPy background!).

Flag missing values (create a sentinel feature)

# add boolean feature: was age missing?
df['age_was_missing'] = df['age'].isna().astype(int)
# then impute
df['age'] = df['age'].fillna(df['age'].median())

Flagging can preserve information about missingness itself, which is often predictive.


4) Dtype gotchas and modern pandas types

  • Numeric columns with NaN become float64. If you want integers, use pandas' nullable integer dtype: 'Int64'. Example:
# after imputation
df['some_int'] = df['some_int'].fillna(0).astype('Int64')
  • Avoid using .apply row-wise for large DataFrames — it's slow. Prefer vectorized pandas/NumPy ops.

  • Be careful with inplace=True: assignment is often clearer and safer (inplace is being discouraged in some methods).


5) Advanced tips (short but powerful)

  • Don’t leak: when preparing training/test splits, fit imputers only on training data to avoid leaking information from the test set.
  • Use scikit-learn’s SimpleImputer or IterativeImputer for pipeline-friendly, reproducible imputations.
  • For categorical features, a special category like 'MISSING' often works better than mode imputation.
  • Visualize missingness patterns with missingno or seaborn heatmaps — patterns can reveal systematic problems.

Quick checklist before modeling

  • Did I quantify missingness per column and per row?
  • Did I inspect whether missingness correlates with other variables? (possible bias)
  • Did I avoid leaking test data when imputing?
  • Did I choose imputation method that respects data type and distribution?
  • Did I consider flagging missingness as a feature?

Key takeaways

  • Missing values are common; detection (df.isna()) is the first step.
  • Use vectorized pandas/NumPy operations — avoid per-row apply where possible.
  • Choose strategy based on domain knowledge: drop, fill, interpolate, or model-based imputation.
  • Preserve dtype when needed (pandas nullable dtypes) and avoid leakage in ML workflows.

"A dataset without NaNs is like a calm lake — but don’t ignore the rocks under the surface. Inspect, then act."

Go forth and clean! If you want, I can show how to integrate imputations into a scikit-learn Pipeline or demonstrate model-based imputation examples next.

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics