Data Cleaning and Feature Engineering
Prepare high-quality datasets with robust transformations and informative features while avoiding leakage.
Content
Detecting and Handling Outliers
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Detecting and Handling Outliers — pandas Tricks That Actually Work
You're already comfortable loading data with SQLAlchemy, cleaning messy strings with pandas' string methods and regex, and smoothing time-series quirks with window/rolling ops. Great — because outliers love to hide in exactly those places. This guide fast-forwards from "I can get the data" to "I can tame the weird stuff in it."
Why outliers matter (and why you should care)
- Outliers are observations that deviate markedly from the rest of the data.
- They can be data errors (typos, failed sensor reads, bad merges) or legitimate rare events (fraud, spikes, black swan events).
- Left unchecked, outliers can skew means, break models, and ruin cross-validation.
"Outliers are like that one friend who shows up to every party in a dinosaur costume — sometimes hilarious, sometimes a disaster. Know when to laugh and when to ask them to leave."
Quick workflow: Detect → Diagnose → Decide → Document
- Detect candidates using stats and visuals
- Diagnose whether they're errors, rare-but-real, or meaningful signals
- Decide to keep, transform, cap, or impute
- Document every change and create flags/features
This is not just cleanup — it's feature engineering. A flag like is_outlier can be predictive.
Tools & techniques (with pandas code snippets)
We'll assume df is your pandas DataFrame and x is the numeric column.
1) Visual inspection
- Boxplots, histograms, scatterplots, and time-series plots (use rolling ops for context)
import matplotlib.pyplot as plt
import seaborn as sns
sns.boxplot(df['x'])
plt.show()
df['x'].hist(bins=50)
plt.show()
# For time series context (you used rolling ops earlier):
df['x'].rolling(window=24).mean().plot()
df['x'].plot(alpha=0.3)
plt.show()
2) IQR method (simple and robust)
- Good for skewed data; common rule: points below Q1 - 1.5IQR or above Q3 + 1.5IQR
Q1 = df['x'].quantile(0.25)
Q3 = df['x'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers_iqr = df[(df['x'] < lower) | (df['x'] > upper)]
Micro explanation: IQR ignores extreme tails and is less sensitive to a few large anomalies than the mean/std.
3) Z-score and robust z-score
- Z-score = (x - mean)/std. Works for roughly normal distributions, but fragile to extreme values.
- Robust z-score uses median and MAD (median absolute deviation).
from scipy import stats
# classic z-score
df['z'] = (df['x'] - df['x'].mean()) / df['x'].std()
# robust z-score using MAD
mad = (np.abs(df['x'] - df['x'].median())).median()
df['robust_z'] = 0.6745 * (df['x'] - df['x'].median()) / mad
outliers_z = df[df['z'].abs() > 3]
outliers_robust = df[df['robust_z'].abs() > 3.5]
(Why 0.6745? It scales MAD to be comparable to standard deviation for normal data.)
4) Time-series specific: rolling z-score
- An observation can be normal globally but abnormal locally. Use rolling mean/std to spot local anomalies.
rolling_mean = df['x'].rolling(window=48, center=True).mean()
rolling_std = df['x'].rolling(window=48, center=True).std()
df['rolling_z'] = (df['x'] - rolling_mean) / rolling_std
local_outliers = df[df['rolling_z'].abs() > 3]
This is where your previous work with window and rolling ops pays off.
5) Categorical & string outliers
- Use pandas string methods and regex to detect weird categories or malformed entries (reference to your string/regex work).
# detect unusual email formats or stray characters
bad_emails = df[~df['email'].str.match(r"^[\w.-]+@[\w.-]+\.\w+$")]
# detect unseen categories
value_counts = df['category'].value_counts()
rare = value_counts[value_counts < 5].index
df['category'].isin(rare)
Ways to handle outliers (with pros/cons)
- Remove: drop rows. Simple but risky — may remove meaningful rare events.
- Cap/Winsorize: set extreme values to percentile limits (e.g., 1st/99th). Preserves rank & sample size.
- Transform: log, Box-Cox, Yeo-Johnson — make distribution less skewed for models relying on normality.
- Impute: replace with median/neighbor values if value is clearly erroneous.
- Flag as feature: add
is_outlieroroutlier_distanceto retain info.
Examples:
# Winsorize via percentiles
lower, upper = df['x'].quantile([0.01, 0.99])
df['x_wins'] = df['x'].clip(lower, upper)
# Log transform (only positive)
df['x_log'] = np.log1p(df['x'].clip(lower=0))
# Flagging
df['x_is_outlier'] = ((df['x'] < lower) | (df['x'] > upper)).astype(int)
Decision guide: What should you do?
- If value is a clear data error (typo, sensor glitch) — fix or remove.
- If value is a rare but valid event and relevant to the problem (fraud detection, spikes) — keep + flag.
- If modeling requires normality or is sensitive to scale — transform or use robust models (e.g., tree-based models tolerate outliers).
- For production pipelines, prefer deterministic, documented rules (percentile caps, flag creation) rather than one-off manual edits.
Feature engineering ideas from outliers
is_outlierboolean per featureoutlier_countacross featuresoutlier_distance(how many MADs or IQRs away)winsorized_featureto stabilize model inputs
These often improve predictive power — the fact an observation is extreme can be predictive on its own.
Final checklist before you ship
- Keep a copy of raw data.
- Log all rules used to detect/handle outliers.
- Validate model performance with and without outlier handling.
- Use visual checks (boxplots, residual plots) after transformations.
"If you can explain why you removed an observation in plain English, you probably made the right choice."
Key takeaways
- Outliers can be errors or signals — treat them thoughtfully.
- Use IQR, z-score (or robust z), and rolling statistics for detection.
- Handle with remove/cap/transform/flag — and always document.
- Outlier handling is both cleaning and feature engineering — use it to your advantage.
Now go flex those pandas muscles: combine your SQLAlchemy-powered ingestion, your regex string-cleaning, and your rolling-window intuition to build pipelines that detect, explain, and use outliers instead of letting them sabotage your models.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!