Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

Noisy Labels and Annotation Quality Out-of-Distribution Detection Data Leakage from Temporal Effects Drift Detection and Adaptation Rare Events and Positive-Unlabeled Data High Cardinality Categorical Features Skewed Targets in Regression Missing Not at Random Mechanisms Data Augmentation for Tabular Data Weak Supervision and Distant Labels Semi-Supervised Add-ons to Supervised Privacy-Preserving Feature Engineering Federated Learning Basics for Supervised Small Data and High-D Variants Shortcut Learning and Spurious Correlation

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Handling Real-World Data Issues

Handling Real-World Data Issues

26086 views

Tackle noise, drift, imbalance, and other practical dataset challenges in production-like settings.

Content

6 of 15

High Cardinality Categorical Features

High-Cardinality, No Chill

3396 views

intermediate

humorous

data-science

gpt-5-mini

3396 views

Versions:

High-Cardinality, No Chill

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

High-Cardinality Categorical Features — The Party With 10,000 Guests and No Name Tags

“Categorical variables are like people at a networking event. High-cardinality categories are the angry guest list that never ends.”

You’ve already wrestled with trees and ensembles (they’re great at hiding complexity in plain sight) and learned to hunt down rare events and detect drift. High-cardinality categories are the bridge between those topics: they can cause overfitting like rare events do, and they can shift over time like drifting features. Let’s tame this unruly party.

Why this matters (quick recap + consequence)

High-cardinality categorical feature = a categorical variable with many distinct values (think: user_id, product_sku, zip code, email domain, ad_id).
Trees and ensembles tolerate categorical splits better than linear models, but when categories explode, models either memorize (overfit) or explode in memory/compute.
This interacts with previous topics:
- Like rare events, many categories are rare (singletons). Handling them poorly lets the model learn nonsense.
- Like drift, categories evolve — new IDs appear in production. You need strategies robust to unseen values.

The toolbox — options, pros/cons, and when to use them

Method	Pros	Cons	When to use
One-hot encoding	Interpretable, simple	Extremely large dimensionality, memory blowup	Only when cardinality is small (< 30)
Frequency / Count encoding	Compact, captures prevalence	Loses identity information, may leak signal temporally	Good baseline; cheap and robust
Target encoding (mean/impact encoding) w/ smoothing & CV	Powerful signal compression	Can leak (target leakage) if not done carefully	Tabular data with enough records per category
K-fold / Leave-one-out target encoding	Reduces leakage vs naive target encoding	Complex; still leakage risk in small data	When target encoding is desired but leakage must be controlled
Feature hashing	Fixed-size representation, robust to unseen	Collisions; less interpretable	High-card features with streaming/unseen values
Clustering / grouping categories	Reduces cardinality; may reveal structure	Requires good clustering features/heuristics	When meta-info exists (e.g., category metadata)
Entity embeddings (NN/learned)	Captures similarity patterns; compact	Requires NN pipeline; harder to interpret	Large datasets; deep learning workflows
Native categorical handling (CatBoost, LightGBM techniques)	Model-aware encodings, less manual work	Model-specific; may still overfit	When using those boosted-tree libraries

Deep dive: Target encoding — the unicorn many fear (do it right)

Target encoding replaces a category with a statistic of the target (e.g., mean target for that category). Naive approach leaks: if you compute the mean across the full training set, the model sees the label indirectly.

How to do it safely:

K-fold scheme: For each fold, compute category means on the other folds and apply to the held-out fold.
Smoothing / Bayesian shrinkage: Blend the category mean with the global mean based on category frequency.
Add noise (for regression) or regularization (for classification) to reduce overfitting.
Maintain mapping and fallback for unseen categories at test time (global mean or prior-adjusted mean).

Pseudocode (K-fold target encoding):

for fold in KFold(n_splits):
    train_idx, val_idx = fold.split(X)
    means = df.loc[train_idx].groupby(cat_col)[target].mean()
    counts = df.loc[train_idx].groupby(cat_col).size()
    smooth = (counts * means + prior * m) / (counts + m)  # m is smoothing param
    df.loc[val_idx, f'{cat_col}_te'] = df.loc[val_idx, cat_col].map(smooth).fillna(prior)

Notes: choose m to control shrinkage; larger m means stronger pull to the prior (global mean).

Feature hashing — “random hashing” for speed & simplicity

Feature hashing hashes category values into a fixed number of buckets (say 2^16). It’s memory-efficient and naturally handles unseen values. Collisions happen — sometimes that's fine, sometimes it’s not.

When to hash:

Streaming data or massive cardinalities (millions).
Fast baselines where interpretability is secondary.

Be careful: tuning the number of buckets is crucial. Too few → harmful collisions; too many → memory back to square one.

Entity embeddings — the neural aristocrats

Trainable embeddings map categories to dense vectors learned to minimize your loss. Great when categories have latent relationships (e.g., products similar by behavior). Use when:

You have large corpora of examples per category.
You’re already comfortable with neural nets.

Embeddings can be fed into tree models too (learn embedding in NN, export vector, then use as numeric features in XGBoost/LightGBM).

Tree-based models: don’t assume they fix everything

Trees do splits on categories, and ensembles help, but:

A tree can still overfit to rare categories (pure leaves with one category).
CatBoost and LightGBM have clever native treatments (CatBoost uses ordered target statistics to reduce leakage). If using these libraries, prefer their built-in approaches.

Remember: ensembles + naive target encoding = express overfitting superpower. Use CV-based encoding and regularization.

Practical checklist — what to try, in order

Baseline: frequency/count encoding + model that handles numerics.
If signal is weak, try grouping rare categories into 'OTHER' or by domain rules.
Target encoding with K-fold + smoothing (careful with leakage). Validate with a holdout and time-based splits if data is temporal.
Feature hashing for scale/streaming needs.
Entity embeddings if you have deep-learning capacity and lots of data.
Try model-native strategies (CatBoost/LightGBM) and compare.

Special notes: temporal data & drift (you saw this coming)

If categories evolve (new users/products), evaluate on future data. Do not compute encodings using later data — that’s classic leakage.
For drift detection, monitor category distributions and encoding statistics (means/counts). If they shift, retrain or adapt encodings (online updates, adaptive smoothing).
Rare categories and positive-unlabeled settings: treat singletons carefully—maybe group them or use strong shrinkage. Rare categories are like rare events: they carry noise and potentially spurious signal.

Quick guide: handling unseen categories in production

Always provide a fallback encoding: global mean, frequency bin, or hashed bucket.
For embeddings/hashing, unseen values map to a default vector or deterministic hash.
Log and monitor frequency of "unknown"s — ramping unknowns = drift.

Closing — TL;DR + a tiny rant

High-card categorical features are powerful but toxic if untreated. Treat them like gossip: verify before you commit to it.
Start simple (counts), graduate to target encoding with K-fold & smoothing, and scale with hashing or embeddings as needed.
Remember the lessons from rare events and drift: respect scarcity, avoid leakage, and watch for change.

Final thought: a model that memorizes category IDs at training may be very proud — until it meets the wild, brutal reality of production data. Build features that generalize, not trophies for your train set.

Version note: this builds on the previous module about tree-based models (use model-native categorical tools when available) and the sections on rare events and drift (apply shrinkage and monitor for category shift).

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics