Data Cleaning and Feature Engineering
Prepare high-quality datasets and craft informative features using robust, repeatable pipelines.
Content
Outlier Detection
Versions:
Watch & Learn
AI-discovered learning video
Outlier Detection — The Outlaw Roundup (Data Cleaning & Feature Engineering)
"An outlier is just a datapoint that refused to play nice. Your job: decide if it’s a genius, a liar, or a sensor that needs to be benched." — Your friendly neighborhood TA
Why this matters (fast, not boring)
You already learned how to wrangle arrays and tables with NumPy and Pandas, and made some plots with Seaborn (remember that glorious boxplot from the Seaborn Quickstart?). You also assessed data quality earlier — completeness, consistency, validity. Now we ask: what about values that are technically valid but wildly unrepresentative? Those are outliers. They’ll skew means, blow up standard deviations, mislead models, and turn your metrics into drama.
This lesson builds on those skills: use your Pandas chops + Seaborn visuals + a couple of ML tricks to detect, understand, and handle outliers — not like a blunt axe, but like a selective bouncer.
The concept in one snappy paragraph
An outlier is an observation that differs markedly from other observations. Outliers can be:
- Errors (typos, sensor failures),
- Rare but real events (fraud, anomalies), or
- Legitimate extreme values that are meaningful for modeling.
Detecting them is not just about removing weird rows; it's about deciding what they are, why they exist, and how to treat them for downstream tasks.
Quick taxonomy (so you can speak like an informed barista)
| Type | Univariate or Multivariate | Typical methods | Good for |
|---|---|---|---|
| Simple extremes | Univariate | IQR, z-score, boxplots | Quick checks on one feature |
| Skewed distributions | Univariate | Transformations (log, Box-Cox), robust stats | Features with long tails |
| Multivariate anomalies | Multivariate | Isolation Forest, LocalOutlierFactor, DBSCAN | Interactions between features |
| Influential points | Regression context | Leverage, Cook's distance | Points that disproportionately shift model parameters |
Tell-tale signs & visual diagnostics
- Boxplot (Seaborn): instant party for univariate outliers.
- Scatterplot / pairplot: shows multivariate weirdos.
- Mahalanobis distance heatmap: shows which rows are far from the covariate center.
Code snippet (Pandas + Seaborn quick reminder):
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot(x='feature', data=df)
plt.show()
sns.scatterplot(x='feature1', y='feature2', data=df)
plt.show()
Ask yourself: does that lone point look like a measurement error, or is it the butterfly causing a storm?
Classical univariate methods (fast rules)
IQR method (robust):
- Q1 = 25th percentile, Q3 = 75th percentile, IQR = Q3 - Q1
- Typical rule: flag values < Q1 - 1.5IQR or > Q3 + 1.5IQR
- Great because it doesn’t care if your distribution is not normal.
z-score (mean/std):
- z = (x - mean)/std, typical cutoff |z| > 3
- Sensitive to the very outliers you’re trying to detect (not robust).
Winsorization & trimming:
- Winsorize: clamp extremes to a percentile (e.g., 1st and 99th)
- Trim: remove top/bottom x% (dangerous if you don’t inspect first)
Code (IQR detection in pandas):
Q1 = df['feature'].quantile(0.25)
Q3 = df['feature'].quantile(0.75)
IQR = Q3 - Q1
mask_outlier = (df['feature'] < (Q1 - 1.5*IQR)) | (df['feature'] > (Q3 + 1.5*IQR))
outliers = df[mask_outlier]
Multivariate outliers — because features conspire
Sometimes every single feature is fine alone, but the combo is weird. Think: height and weight extremes that only make sense together.
- Isolation Forest (tree-based anomaly score): good general-purpose detector; works on tabular numeric data.
- Local Outlier Factor (LOF): finds points with low local density.
- DBSCAN: density-based clustering that also yields noise points.
Example using sklearn IsolationForest:
from sklearn.ensemble import IsolationForest
iso = IsolationForest(contamination=0.01, random_state=42)
df_numeric = df.select_dtypes(include='number').fillna(0)
iso.fit(df_numeric)
labels = iso.predict(df_numeric) # -1 = outlier, 1 = inlier
outliers = df[labels == -1]
Tip: scale features (RobustScaler or StandardScaler) before these algorithms.
Influence vs Outlier — subtle but critical
- An outlier is extreme in the feature space.
- An influential point drastically changes a model’s parameters — not necessarily super extreme in raw value.
In regression, use leverage and Cook’s distance to find points that disproportionately affect the fitted line. If your linear model jumps when you drop a point, that point is influential.
Practical workflow (a reproducible checklist)
- Visualize: boxplots, scatterplots, pairplots (Seaborn). Ask questions.
- Detect univariate extremes (IQR for robustness). Flag candidates.
- Use multivariate detectors for interaction anomalies (IsolationForest, LOF).
- For modeling, check influence (Cook’s distance) if doing regression.
- Decide action: keep, transform, winsorize, or remove — document it.
Pseudocode:
for feature in numeric_features:
visualize(feature)
flag_univariate_outliers(feature)
run_multivariate_detector(numeric_features)
for candidate in flagged_points:
inspect_raw_data(candidate)
if error -> fix or drop
elif rare_event -> keep or label
else -> transform/winsorize
Ask: "If I remove this point, does my model still generalize?" That’s the real test.
Short examples of handling strategies
- Transform skewed money amounts: log(x + 1)
- If sensor error -> impute or drop
- Fraud detection -> keep and label as positive cases
- For tree-based models, outliers often matter less; for linear/regression, they matter a lot
Final mic drop — practical rules of thumb
- Always visualize first. Numbers without pictures are suspicious.
- Use robust detectors (IQR, RobustScaler) when in doubt.
- Don’t auto-delete. Document every change.
- Separate anomaly detection tasks (you want outliers) from cleaning for modeling tasks (you may want to remove them).
If you treat outliers like weeds, you might weed out a rare flower. Inspect before pulling.
Key takeaways
- Outliers can be errors, rare events, or meaningful extremes.
- Start with visualization (Seaborn), then use robust statistics (IQR) and multivariate tools (IsolationForest, LOF).
- For models, check influence; treat features appropriately (transform, winsorize, label, or remove).
- Keep experiments reproducible: record which rows were flagged and your rationale.
Go forth and tame your data — but do it like a thoughtful scientist, not a tempestuous janitor.
Version notes: this lesson assumes you’re comfortable selecting numeric columns in Pandas, plotting with Seaborn, and fitting basic sklearn models as covered earlier in the course.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!