Courses/Python for Data Science, AI & Development/Data Cleaning and Feature Engineering

Data Cleaning and Feature Engineering

43380 views

Prepare high-quality datasets with robust transformations and informative features while avoiding leakage.

Content

1 of 15

Detecting and Handling Outliers

Detecting and Handling Outliers in pandas for Data Science

5702 views

beginner

data-science

pandas

feature-engineering

humorous

gpt-5-mini

5702 views

Versions:

Detecting and Handling Outliers in pandas for Data Science

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Detecting and Handling Outliers — pandas Tricks That Actually Work

You're already comfortable loading data with SQLAlchemy, cleaning messy strings with pandas' string methods and regex, and smoothing time-series quirks with window/rolling ops. Great — because outliers love to hide in exactly those places. This guide fast-forwards from "I can get the data" to "I can tame the weird stuff in it."

Why outliers matter (and why you should care)

Outliers are observations that deviate markedly from the rest of the data.
They can be data errors (typos, failed sensor reads, bad merges) or legitimate rare events (fraud, spikes, black swan events).
Left unchecked, outliers can skew means, break models, and ruin cross-validation.

"Outliers are like that one friend who shows up to every party in a dinosaur costume — sometimes hilarious, sometimes a disaster. Know when to laugh and when to ask them to leave."

Quick workflow: Detect → Diagnose → Decide → Document

Detect candidates using stats and visuals
Diagnose whether they're errors, rare-but-real, or meaningful signals
Decide to keep, transform, cap, or impute
Document every change and create flags/features

This is not just cleanup — it's feature engineering. A flag like is_outlier can be predictive.

Tools & techniques (with pandas code snippets)

We'll assume df is your pandas DataFrame and x is the numeric column.

1) Visual inspection

Boxplots, histograms, scatterplots, and time-series plots (use rolling ops for context)

import matplotlib.pyplot as plt
import seaborn as sns

sns.boxplot(df['x'])
plt.show()

df['x'].hist(bins=50)
plt.show()

# For time series context (you used rolling ops earlier):
df['x'].rolling(window=24).mean().plot()
df['x'].plot(alpha=0.3)
plt.show()

2) IQR method (simple and robust)

Good for skewed data; common rule: points below Q1 - 1.5IQR or above Q3 + 1.5IQR

Q1 = df['x'].quantile(0.25)
Q3 = df['x'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

outliers_iqr = df[(df['x'] < lower) | (df['x'] > upper)]

Micro explanation: IQR ignores extreme tails and is less sensitive to a few large anomalies than the mean/std.

3) Z-score and robust z-score

Z-score = (x - mean)/std. Works for roughly normal distributions, but fragile to extreme values.
Robust z-score uses median and MAD (median absolute deviation).

from scipy import stats

# classic z-score
df['z'] = (df['x'] - df['x'].mean()) / df['x'].std()

# robust z-score using MAD
mad = (np.abs(df['x'] - df['x'].median())).median()
df['robust_z'] = 0.6745 * (df['x'] - df['x'].median()) / mad

outliers_z = df[df['z'].abs() > 3]
outliers_robust = df[df['robust_z'].abs() > 3.5]

(Why 0.6745? It scales MAD to be comparable to standard deviation for normal data.)

4) Time-series specific: rolling z-score

An observation can be normal globally but abnormal locally. Use rolling mean/std to spot local anomalies.

rolling_mean = df['x'].rolling(window=48, center=True).mean()
rolling_std = df['x'].rolling(window=48, center=True).std()

df['rolling_z'] = (df['x'] - rolling_mean) / rolling_std
local_outliers = df[df['rolling_z'].abs() > 3]

This is where your previous work with window and rolling ops pays off.

5) Categorical & string outliers

Use pandas string methods and regex to detect weird categories or malformed entries (reference to your string/regex work).

# detect unusual email formats or stray characters
bad_emails = df[~df['email'].str.match(r"^[\w.-]+@[\w.-]+\.\w+$")] 

# detect unseen categories
value_counts = df['category'].value_counts()
rare = value_counts[value_counts < 5].index
df['category'].isin(rare)

Ways to handle outliers (with pros/cons)

Remove: drop rows. Simple but risky — may remove meaningful rare events.
Cap/Winsorize: set extreme values to percentile limits (e.g., 1st/99th). Preserves rank & sample size.
Transform: log, Box-Cox, Yeo-Johnson — make distribution less skewed for models relying on normality.
Impute: replace with median/neighbor values if value is clearly erroneous.
Flag as feature: add is_outlier or outlier_distance to retain info.

Examples:

# Winsorize via percentiles
lower, upper = df['x'].quantile([0.01, 0.99])
df['x_wins'] = df['x'].clip(lower, upper)

# Log transform (only positive)
df['x_log'] = np.log1p(df['x'].clip(lower=0))

# Flagging
df['x_is_outlier'] = ((df['x'] < lower) | (df['x'] > upper)).astype(int)

Decision guide: What should you do?

If value is a clear data error (typo, sensor glitch) — fix or remove.
If value is a rare but valid event and relevant to the problem (fraud detection, spikes) — keep + flag.
If modeling requires normality or is sensitive to scale — transform or use robust models (e.g., tree-based models tolerate outliers).
For production pipelines, prefer deterministic, documented rules (percentile caps, flag creation) rather than one-off manual edits.

Feature engineering ideas from outliers

is_outlier boolean per feature
outlier_count across features
outlier_distance (how many MADs or IQRs away)
winsorized_feature to stabilize model inputs

These often improve predictive power — the fact an observation is extreme can be predictive on its own.

Final checklist before you ship

Keep a copy of raw data.
Log all rules used to detect/handle outliers.
Validate model performance with and without outlier handling.
Use visual checks (boxplots, residual plots) after transformations.

"If you can explain why you removed an observation in plain English, you probably made the right choice."

Key takeaways

Outliers can be errors or signals — treat them thoughtfully.
Use IQR, z-score (or robust z), and rolling statistics for detection.
Handle with remove/cap/transform/flag — and always document.
Outlier handling is both cleaning and feature engineering — use it to your advantage.

Now go flex those pandas muscles: combine your SQLAlchemy-powered ingestion, your regex string-cleaning, and your rolling-window intuition to build pipelines that detect, explain, and use outliers instead of letting them sabotage your models.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics