Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

Data Types and Tidy Structure Handling Missing Values Outlier Detection and Treatment Categorical Encoding Schemes Ordinal vs Nominal Encodings Text Features: Bag-of-Words and TF-IDF Date and Time Feature Extraction Scaling and Normalization Techniques Binning and Discretization Interaction and Polynomial Features Target Leakage in Feature Engineering Feature Creation from Domain Knowledge Sparse vs Dense Representations Feature Hashing Basics Managing High Cardinality

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Data Wrangling and Feature Engineering

Data Wrangling and Feature Engineering

25847 views

Practical techniques to clean, encode, scale, and construct informative features while avoiding leakage.

Content

3 of 15

Outlier Detection and Treatment

Outliers, But Make It Dramatic

6706 views

intermediate

humorous

visual

machine learning

data science

gpt-5-mini

6706 views

Versions:

Outliers, But Make It Dramatic

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Outlier Detection and Treatment — The Part of Cleaning That Separates "Oops" from "Aha"

"Outliers are the data points that walk into the party wearing a cape and yelling: ‘I am important!’ — sometimes they’re right, sometimes they’re just very drunk."

You're coming in hot from: Data Types & Tidy Structure (so your columns are sane) and Handling Missing Values (so there aren't mysterious NaNs hiding in the bushes). You also know the core goals of supervised learning (bias/variance, generalization). Great — that means we can skip the slow-mo basics and get to the fun: triaging misbehaving records before your model learns the wrong things.

Why outliers matter (especially in regression & classification)

Regression: A few extreme y-values or x-values can drag OLS estimates like a lead anchor — inflated coefficients, busted residual assumptions, and crazy prediction intervals. Leverage + influence = disasters on test data.
Classification: Rare but extreme samples can skew decision boundaries, confuse distance metrics, and ruin metrics if those extremes are actually label noise or attack points.

Big-picture: Outliers affect model assumptions, training stability, metric interpretation, and sometimes they are the signal you actually want (e.g., fraud detection). So treat them with context, not with ideology.

Types of outliers — know thy enemy

Global (point) outliers: A single record far from the rest in feature space.
Contextual (conditional) outliers: Normal in one context, anomalous in another (e.g., temp=30°C is normal in summer but weird in winter).
Collective outliers: A group of points that is anomalous together (e.g., a sudden sensor drift).

Also: Univariate vs Multivariate — a value might be normal on one axis but bizarre in combination with other features.

Quick detection toolbox (from simple to fancy)

Univariate (one column at a time)

Visual: Boxplots, histograms, violin plots
Rules: IQR method (Tukey), z-score or robust z-score (MAD)

Multivariate / model-based

Distance-based: Mahalanobis distance
Density / neighborhood: Local Outlier Factor (LOF)
Tree / ensemble: Isolation Forest
Clustering: DBSCAN (finds points not in dense clusters)
Influence diagnostics (for regression): Cook's distance, leverage (hat matrix)

Quick reference table

Method	Use case	Pros	Cons
IQR / Tukey	Univariate numeric	Simple, interpretable	Misses multivariate anomalies
Z-score / MAD	Univariate	Works for near-normal or skewed (MAD)	Sensitive to distribution
Mahalanobis	Multivariate	Accounts for covariance	Requires invertible covariance, sensitive to outliers
LOF	Multivariate	Detects local density anomalies	Needs tuning k; O(n log n) or worse
Isolation Forest	Multivariate	Fast, scaleable, few assumptions	Randomness; needs tuning
Cook's distance	Regression influence	Targets influential points on fit	Only for regression, needs model fit

Rules of thumb + concrete methods

1) Visual first, then quantify

Make a boxplot for every numeric column and a scatter matrix for suspicious pairs.
Ask: Does the point look like an error, a rare-but-important event, or a legitimate extreme? If you can answer this, you’re halfway there.

2) Univariate detection (pandas + classic statistics)

IQR rule: mark x < Q1 - 1.5IQR or x > Q3 + 1.5IQR
Z-score: |(x - mean)/std| > 3
Robust z-score (using MAD) when distributions are skewed

Code-snippet (pandas pseudocode):

# IQR outliers
Q1 = df['col'].quantile(0.25)
Q3 = df['col'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['col'] < Q1 - 1.5*IQR) | (df['col'] > Q3 + 1.5*IQR)]

3) Multivariate detection (sklearn)

IsolationForest and LocalOutlierFactor are your friends for mixed-feature anomalies.

Python example:

from sklearn.ensemble import IsolationForest
clf = IsolationForest(contamination=0.01, random_state=0)
outlier_labels = clf.fit_predict(X)  # -1 outlier, 1 normal

4) Influence in regression

Fit your regression, compute Cook's distance. Points with high Cook's distance can unduly change coefficients. Consider inspecting and possibly removing or reweighting.

5) When to keep vs change vs remove

Keep: If the outlier is a truthful, rare example you want the model to learn (e.g., fraud).
Treat (transform/robustify): If the outliers distort your model but are legitimate (e.g., heavy right skew). Try log/sqrt transforms, winsorizing, or robust models (RANSAC, HuberRegressor).
Remove: If the point is clearly erroneous (measurement error, data-entry error) and you can't correct it. Document every removal.

Treatment options (with consequences)

Remove / drop: Simple, but risks throwing away real signal. Always log row IDs removed.
Cap / Winsorize: Replace extreme values with a percentile (e.g., 1st/99th). Less destructive than deletion.
Transform: Log, Box-Cox, Yeo-Johnson — reduces skew and impact of extremes.
Flag and keep: Add an "is_outlier" boolean feature so models can learn special handling.
Use robust algorithms: Tree-based models, robust regressors, or nonparametric learners less sensitive to outliers.
Impute / correct: If the outlier is a typo (e.g., salary 10,000,000 instead of 100,000), fix it from source or domain rules.

Workflow checklist (practical playbook)

Ensure data types are correct (recall: from Data Types & Tidy Structure). Strings masquerading as numbers break everything.
Handle missing values before outlier detection? Usually yes — but be careful: imputing with mean can hide real outliers.
Visualize distributions and relationships (boxplots, scatter, pairplots).
Run univariate checks (IQR, MAD) and multivariate methods (IsolationForest/LOF).
Investigate flagged points with domain knowledge — talk to an SME if you can.
Decide: keep, transform, impute, cap, or remove. Document reasons.
Re-run model diagnostics (residuals, Cook's distance, validation metrics). Compare performance with and without treatments.

A couple of illustrative examples

Housing prices: A $10M mansion among $200k homes is likely real (keep), but a price of $1 might be a data error (fix/drop). For regression: robust regression or log(price) can help.
Sensor data: If a temperature sensor suddenly outputs 9999, that’s a sensor fault — correct/drop. If it gradually drifts, that’s a collective outlier (needs time-series specific handling).
Fraud detection: Outliers are the target, not the enemy. You will treat them as positive examples with specialized models rather than removing them.

Closing — the meta-rule

Outlier treatment is less about picking the perfect algorithm and more about contextual triage. Ask: Is this point a mistake, or is it the story I'm trying to hear? When in doubt, flag and model it both ways: one pipeline that keeps/extols outliers, another that tames them. Compare validation performance and be accountable: keep a log of every transformation.

Key takeaways:

Visualize first; quantify second.
Use simple rules for quick wins, model-based methods for complex patterns.
Never blindly delete — document and justify.
Sometimes outliers are your gold (fraud) — treat accordingly.

Now: go run a boxplot and find the drama in your dataset. If you bring me a scatter plot with a lone point at the edge, I will not only sigh; I will demand a story.

"Outliers are the data's way of throwing confetti at you — celebrate them when they matter, sweep them up when they’re just garbage."

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics