jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Supervised Machine Learning: Regression and Classification
Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

Data Types and Tidy StructureHandling Missing ValuesOutlier Detection and TreatmentCategorical Encoding SchemesOrdinal vs Nominal EncodingsText Features: Bag-of-Words and TF-IDFDate and Time Feature ExtractionScaling and Normalization TechniquesBinning and DiscretizationInteraction and Polynomial FeaturesTarget Leakage in Feature EngineeringFeature Creation from Domain KnowledgeSparse vs Dense RepresentationsFeature Hashing BasicsManaging High Cardinality

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Data Wrangling and Feature Engineering

Data Wrangling and Feature Engineering

25831 views

Practical techniques to clean, encode, scale, and construct informative features while avoiding leakage.

Content

2 of 15

Handling Missing Values

Missing Values: Sassy, Practical, and Pipeline-Safe
5101 views
intermediate
humorous
science
gpt-5-mini
5101 views

Versions:

Missing Values: Sassy, Practical, and Pipeline-Safe

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Handling Missing Values — The Emotional and Practical Makeover for Your Dataset

"Missing data isn't broken data — it's data with feelings. Treat it kindly, or your model will ghost you at deployment." — Your slightly dramatic TA

You're past the awkward stage where we talked about tidy structure and data types (remember that thrilling saga?), and you know what labels are and whether you're doing regression or classification. Now we face a reality check: real datasets have holes. Lots of holes. Some are innocent, some are lying, and some are screaming useful information at you through the void.

This guide gives you the who/why/how of missing values: how to detect them, when to impute, when to engineer missingness as a feature, and how to do all of it without leaking your validation data or accidentally teaching your model to be a fortune teller.


Quick taxonomy: Why values are missing (this matters)

  • MCAR — Missing Completely At Random: The missingness has no relationship to observed or unobserved data. Example: a sensor randomly dropped a reading during transmission.
  • MAR — Missing At Random: The missingness depends on observed data. Example: income is missing more often for younger respondents (age observed).
  • MNAR — Missing Not At Random: Missingness depends on the unobserved value itself. Example: people with very high incomes are less likely to report income.

Why care? Because strategy changes. Imputing blindly assumes something about the missingness mechanism.


First things first: detect & diagnose

  1. Get counts and percents
import pandas as pd
missing = df.isnull().sum().sort_values(ascending=False)
missing_percent = (missing / len(df) * 100).round(2)
pd.concat([missing, missing_percent], axis=1, keys=["n_missing", "%"]) 
  1. Visualize patterns
  • Heatmaps (sns.heatmap(df.isnull()...))
  • Missingness matrix (missingno.matrix)
  • Pairwise patterns (missingno.heatmap or seaborn clustermap)
  1. Correlate missingness with target or other columns
# create a missing indicator and check correlation with target
df['age_missing'] = df['age'].isnull().astype(int)
df.groupby('age_missing')['target'].mean()

If missingness correlates with the target, you just found a feature.


Decision tree: Drop? Impute? Feature engineer?

  • If a column has >~50% missing and little predictive power: consider dropping (unless domain says otherwise).
  • If rows with missingness are a tiny fraction and appear MCAR: dropping rows is okay for many models.
  • If missingness seems informative (MAR/MNAR): don't drop. Engineer it.

Rule of thumb: the cost of losing data vs. the risk of wrong imputation.


Basic imputation methods (fast, explainable)

  • Mean/Median (numeric): good baseline, median is robust to outliers — use for MCAR or when computational simplicity matters.
  • Mode (categorical): common-sense fill for categories.
  • Constant fill (e.g., -999, "Unknown"): handy for tree models; beware scaling issues for linear models.
  • Forward/backward fill (time-series): fill using previous/next value; only for temporally-ordered data.

Pros: simple, fast, reproducible. Cons: underestimates variance, may bias relationships.


Advanced imputation (for when you care about quality)

  • KNN Imputer: imputes using nearest neighbors (good for mixed-patterns; sensitive to scaling).
  • Multiple Imputation by Chained Equations (MICE / IterativeImputer): fits models for each feature conditional on others, iteratively. Preserves relationships better.
  • Matrix factorization / SVD / SoftImpute: good for high-dimensional structured data (e.g., recommender systems).
  • Model-based imputation: train a regression/classifier to predict the missing feature from others.

These preserve correlations but are computationally heavier and can leak if not done inside proper CV pipelines.


Important: avoid leakage — always impute inside the pipeline

This is non-negotiable if you want honest evaluation.

  • Fit your imputer (mean, iterative, etc.) on the training folds only.
  • Use sklearn Pipelines/ColumnTransformer so transformations occur within cross-validation and during deployment.

Example (scikit-learn):

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer, IterativeImputer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

num_cols = ['age','income']
cat_cols = ['gender','region']

num_pipe = Pipeline([('impute', IterativeImputer()), ('scale', StandardScaler())])
cat_pipe = Pipeline([('impute', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preproc = ColumnTransformer([('num', num_pipe, num_cols), ('cat', cat_pipe, cat_cols)])

pipe = Pipeline([('preproc', preproc), ('clf', RandomForestClassifier())])

scores = cross_val_score(pipe, X, y, cv=5)

If you run imputation before cross-validation, your model gets to peek at validation data — and that's how you accidentally build a cheat code.


Treat missingness as a feature — a surprisingly sexy move

Often, missingness tells a story:

  • A null lab result could mean the doctor didn't order the test because they judged it unnecessary — that’s predictive.
  • A blank address might indicate homelessness — also predictive for certain outcomes.

Create indicators:

  • Binary flags (is_missing_age)
  • Aggregated counters (n_missing_features)
  • Time-since-last-observed for time series

These let your model learn that "missing" itself is meaningful instead of being an awkward patch.


Categorical variables: special considerations

  • Don't impute categorical variables with mean. Use mode, or a new category like "Missing".
  • If you one-hot encode, keep the missing category separate so the model can use it.
  • For rare categories, consider grouping into "Other" before imputation.

Practical heuristics & checklist

  1. Inspect: amounts, patterns, relation to target.
  2. Decide: drop variable / drop rows / impute / engineer indicator.
  3. Implement: use pipelines; fit imputers only on training data.
  4. Validate: do sensitivity analysis (try multiple strategies & compare). If results swing wildly, investigate why.
  5. Document: record assumptions (MCAR vs MAR vs MNAR), because future you will need that explanation.

Quick comparison table

Method Pros Cons When to use
Drop rows Simple, unbiased if MCAR Wastes data, biased if MAR/MNAR Very small % missing & MCAR
Mean/Median Fast, interpretable Shrinks variance, biases relationships Baseline, numeric MCAR
Constant fill Works with tree models Can create outliers; breaks linear models Tree models; when missingness is informative
KNN Preserves local structure Slow, sensitive to scaling Small/medium datasets with structure
MICE Preserves multivariate relationships Complex, iterative, heavy When relationships matter (regression tasks)

Closing rant (quick & useful)

Missing values are not just a nuisance — they're a diagnostic tool and potential signal. Treat them like clues in a detective novel, not ants to be stomped with a mean imputer. Use pipelines to avoid leakage, consider missingness indicators, and pick an imputation method that matches your missingness assumptions and computational budget.

Remember: whether you're doing regression predicting house prices or classification predicting churn, sloppy handling of missing data will quietly turn your evaluation metrics into fantasy. Handle carefully, validate robustly, and keep a log of the choices (future you will be thankful).

Go forth and heal those datasets. Your model (and your future self) will thank you.


Version note: This builds on tidy data, correct types, and the basics of supervised learning — now we make your data whole enough to be useful without making it a liar.

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics