Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

Data Types and Tidy Structure Handling Missing Values Outlier Detection and Treatment Categorical Encoding Schemes Ordinal vs Nominal Encodings Text Features: Bag-of-Words and TF-IDF Date and Time Feature Extraction Scaling and Normalization Techniques Binning and Discretization Interaction and Polynomial Features Target Leakage in Feature Engineering Feature Creation from Domain Knowledge Sparse vs Dense Representations Feature Hashing Basics Managing High Cardinality

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Data Wrangling and Feature Engineering

Data Wrangling and Feature Engineering

25847 views

Practical techniques to clean, encode, scale, and construct informative features while avoiding leakage.

Content

2 of 15

Handling Missing Values

Missing Values: Sassy, Practical, and Pipeline-Safe

5101 views

intermediate

humorous

science

gpt-5-mini

5101 views

Versions:

Missing Values: Sassy, Practical, and Pipeline-Safe

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Handling Missing Values — The Emotional and Practical Makeover for Your Dataset

"Missing data isn't broken data — it's data with feelings. Treat it kindly, or your model will ghost you at deployment." — Your slightly dramatic TA

You're past the awkward stage where we talked about tidy structure and data types (remember that thrilling saga?), and you know what labels are and whether you're doing regression or classification. Now we face a reality check: real datasets have holes. Lots of holes. Some are innocent, some are lying, and some are screaming useful information at you through the void.

This guide gives you the who/why/how of missing values: how to detect them, when to impute, when to engineer missingness as a feature, and how to do all of it without leaking your validation data or accidentally teaching your model to be a fortune teller.

Quick taxonomy: Why values are missing (this matters)

MCAR — Missing Completely At Random: The missingness has no relationship to observed or unobserved data. Example: a sensor randomly dropped a reading during transmission.
MAR — Missing At Random: The missingness depends on observed data. Example: income is missing more often for younger respondents (age observed).
MNAR — Missing Not At Random: Missingness depends on the unobserved value itself. Example: people with very high incomes are less likely to report income.

Why care? Because strategy changes. Imputing blindly assumes something about the missingness mechanism.

First things first: detect & diagnose

Get counts and percents

import pandas as pd
missing = df.isnull().sum().sort_values(ascending=False)
missing_percent = (missing / len(df) * 100).round(2)
pd.concat([missing, missing_percent], axis=1, keys=["n_missing", "%"])

Visualize patterns

Heatmaps (sns.heatmap(df.isnull()...))
Missingness matrix (missingno.matrix)
Pairwise patterns (missingno.heatmap or seaborn clustermap)

Correlate missingness with target or other columns

# create a missing indicator and check correlation with target
df['age_missing'] = df['age'].isnull().astype(int)
df.groupby('age_missing')['target'].mean()

If missingness correlates with the target, you just found a feature.

Decision tree: Drop? Impute? Feature engineer?

If a column has >~50% missing and little predictive power: consider dropping (unless domain says otherwise).
If rows with missingness are a tiny fraction and appear MCAR: dropping rows is okay for many models.
If missingness seems informative (MAR/MNAR): don't drop. Engineer it.

Rule of thumb: the cost of losing data vs. the risk of wrong imputation.

Basic imputation methods (fast, explainable)

Mean/Median (numeric): good baseline, median is robust to outliers — use for MCAR or when computational simplicity matters.
Mode (categorical): common-sense fill for categories.
Constant fill (e.g., -999, "Unknown"): handy for tree models; beware scaling issues for linear models.
Forward/backward fill (time-series): fill using previous/next value; only for temporally-ordered data.

Pros: simple, fast, reproducible. Cons: underestimates variance, may bias relationships.

Advanced imputation (for when you care about quality)

KNN Imputer: imputes using nearest neighbors (good for mixed-patterns; sensitive to scaling).
Multiple Imputation by Chained Equations (MICE / IterativeImputer): fits models for each feature conditional on others, iteratively. Preserves relationships better.
Matrix factorization / SVD / SoftImpute: good for high-dimensional structured data (e.g., recommender systems).
Model-based imputation: train a regression/classifier to predict the missing feature from others.

These preserve correlations but are computationally heavier and can leak if not done inside proper CV pipelines.

Important: avoid leakage — always impute inside the pipeline

This is non-negotiable if you want honest evaluation.

Fit your imputer (mean, iterative, etc.) on the training folds only.
Use sklearn Pipelines/ColumnTransformer so transformations occur within cross-validation and during deployment.

Example (scikit-learn):

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer, IterativeImputer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

num_cols = ['age','income']
cat_cols = ['gender','region']

num_pipe = Pipeline([('impute', IterativeImputer()), ('scale', StandardScaler())])
cat_pipe = Pipeline([('impute', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preproc = ColumnTransformer([('num', num_pipe, num_cols), ('cat', cat_pipe, cat_cols)])

pipe = Pipeline([('preproc', preproc), ('clf', RandomForestClassifier())])

scores = cross_val_score(pipe, X, y, cv=5)

If you run imputation before cross-validation, your model gets to peek at validation data — and that's how you accidentally build a cheat code.

Treat missingness as a feature — a surprisingly sexy move

Often, missingness tells a story:

A null lab result could mean the doctor didn't order the test because they judged it unnecessary — that’s predictive.
A blank address might indicate homelessness — also predictive for certain outcomes.

Create indicators:

Binary flags (is_missing_age)
Aggregated counters (n_missing_features)
Time-since-last-observed for time series

These let your model learn that "missing" itself is meaningful instead of being an awkward patch.

Categorical variables: special considerations

Don't impute categorical variables with mean. Use mode, or a new category like "Missing".
If you one-hot encode, keep the missing category separate so the model can use it.
For rare categories, consider grouping into "Other" before imputation.

Practical heuristics & checklist

Inspect: amounts, patterns, relation to target.
Decide: drop variable / drop rows / impute / engineer indicator.
Implement: use pipelines; fit imputers only on training data.
Validate: do sensitivity analysis (try multiple strategies & compare). If results swing wildly, investigate why.
Document: record assumptions (MCAR vs MAR vs MNAR), because future you will need that explanation.

Quick comparison table

Method	Pros	Cons	When to use
Drop rows	Simple, unbiased if MCAR	Wastes data, biased if MAR/MNAR	Very small % missing & MCAR
Mean/Median	Fast, interpretable	Shrinks variance, biases relationships	Baseline, numeric MCAR
Constant fill	Works with tree models	Can create outliers; breaks linear models	Tree models; when missingness is informative
KNN	Preserves local structure	Slow, sensitive to scaling	Small/medium datasets with structure
MICE	Preserves multivariate relationships	Complex, iterative, heavy	When relationships matter (regression tasks)

Closing rant (quick & useful)

Missing values are not just a nuisance — they're a diagnostic tool and potential signal. Treat them like clues in a detective novel, not ants to be stomped with a mean imputer. Use pipelines to avoid leakage, consider missingness indicators, and pick an imputation method that matches your missingness assumptions and computational budget.

Remember: whether you're doing regression predicting house prices or classification predicting churn, sloppy handling of missing data will quietly turn your evaluation metrics into fantasy. Handle carefully, validate robustly, and keep a log of the choices (future you will be thankful).

Go forth and heal those datasets. Your model (and your future self) will thank you.

Version note: This builds on tidy data, correct types, and the basics of supervised learning — now we make your data whole enough to be useful without making it a liar.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics