jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Supervised Machine Learning: Regression and Classification
Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

Noisy Labels and Annotation QualityOut-of-Distribution DetectionData Leakage from Temporal EffectsDrift Detection and AdaptationRare Events and Positive-Unlabeled DataHigh Cardinality Categorical FeaturesSkewed Targets in RegressionMissing Not at Random MechanismsData Augmentation for Tabular DataWeak Supervision and Distant LabelsSemi-Supervised Add-ons to SupervisedPrivacy-Preserving Feature EngineeringFederated Learning Basics for SupervisedSmall Data and High-D VariantsShortcut Learning and Spurious Correlation

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Handling Real-World Data Issues

Handling Real-World Data Issues

26074 views

Tackle noise, drift, imbalance, and other practical dataset challenges in production-like settings.

Content

6 of 15

High Cardinality Categorical Features

High-Cardinality, No Chill
3396 views
intermediate
humorous
data-science
gpt-5-mini
3396 views

Versions:

High-Cardinality, No Chill

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

High-Cardinality Categorical Features — The Party With 10,000 Guests and No Name Tags

“Categorical variables are like people at a networking event. High-cardinality categories are the angry guest list that never ends.”

You’ve already wrestled with trees and ensembles (they’re great at hiding complexity in plain sight) and learned to hunt down rare events and detect drift. High-cardinality categories are the bridge between those topics: they can cause overfitting like rare events do, and they can shift over time like drifting features. Let’s tame this unruly party.


Why this matters (quick recap + consequence)

  • High-cardinality categorical feature = a categorical variable with many distinct values (think: user_id, product_sku, zip code, email domain, ad_id).
  • Trees and ensembles tolerate categorical splits better than linear models, but when categories explode, models either memorize (overfit) or explode in memory/compute.
  • This interacts with previous topics:
    • Like rare events, many categories are rare (singletons). Handling them poorly lets the model learn nonsense.
    • Like drift, categories evolve — new IDs appear in production. You need strategies robust to unseen values.

The toolbox — options, pros/cons, and when to use them

Method Pros Cons When to use
One-hot encoding Interpretable, simple Extremely large dimensionality, memory blowup Only when cardinality is small (< 30)
Frequency / Count encoding Compact, captures prevalence Loses identity information, may leak signal temporally Good baseline; cheap and robust
Target encoding (mean/impact encoding) w/ smoothing & CV Powerful signal compression Can leak (target leakage) if not done carefully Tabular data with enough records per category
K-fold / Leave-one-out target encoding Reduces leakage vs naive target encoding Complex; still leakage risk in small data When target encoding is desired but leakage must be controlled
Feature hashing Fixed-size representation, robust to unseen Collisions; less interpretable High-card features with streaming/unseen values
Clustering / grouping categories Reduces cardinality; may reveal structure Requires good clustering features/heuristics When meta-info exists (e.g., category metadata)
Entity embeddings (NN/learned) Captures similarity patterns; compact Requires NN pipeline; harder to interpret Large datasets; deep learning workflows
Native categorical handling (CatBoost, LightGBM techniques) Model-aware encodings, less manual work Model-specific; may still overfit When using those boosted-tree libraries

Deep dive: Target encoding — the unicorn many fear (do it right)

Target encoding replaces a category with a statistic of the target (e.g., mean target for that category). Naive approach leaks: if you compute the mean across the full training set, the model sees the label indirectly.

How to do it safely:

  1. K-fold scheme: For each fold, compute category means on the other folds and apply to the held-out fold.
  2. Smoothing / Bayesian shrinkage: Blend the category mean with the global mean based on category frequency.
  3. Add noise (for regression) or regularization (for classification) to reduce overfitting.
  4. Maintain mapping and fallback for unseen categories at test time (global mean or prior-adjusted mean).

Pseudocode (K-fold target encoding):

for fold in KFold(n_splits):
    train_idx, val_idx = fold.split(X)
    means = df.loc[train_idx].groupby(cat_col)[target].mean()
    counts = df.loc[train_idx].groupby(cat_col).size()
    smooth = (counts * means + prior * m) / (counts + m)  # m is smoothing param
    df.loc[val_idx, f'{cat_col}_te'] = df.loc[val_idx, cat_col].map(smooth).fillna(prior)

Notes: choose m to control shrinkage; larger m means stronger pull to the prior (global mean).


Feature hashing — “random hashing” for speed & simplicity

Feature hashing hashes category values into a fixed number of buckets (say 2^16). It’s memory-efficient and naturally handles unseen values. Collisions happen — sometimes that's fine, sometimes it’s not.

When to hash:

  • Streaming data or massive cardinalities (millions).
  • Fast baselines where interpretability is secondary.

Be careful: tuning the number of buckets is crucial. Too few → harmful collisions; too many → memory back to square one.


Entity embeddings — the neural aristocrats

Trainable embeddings map categories to dense vectors learned to minimize your loss. Great when categories have latent relationships (e.g., products similar by behavior). Use when:

  • You have large corpora of examples per category.
  • You’re already comfortable with neural nets.

Embeddings can be fed into tree models too (learn embedding in NN, export vector, then use as numeric features in XGBoost/LightGBM).


Tree-based models: don’t assume they fix everything

Trees do splits on categories, and ensembles help, but:

  • A tree can still overfit to rare categories (pure leaves with one category).
  • CatBoost and LightGBM have clever native treatments (CatBoost uses ordered target statistics to reduce leakage). If using these libraries, prefer their built-in approaches.

Remember: ensembles + naive target encoding = express overfitting superpower. Use CV-based encoding and regularization.


Practical checklist — what to try, in order

  1. Baseline: frequency/count encoding + model that handles numerics.
  2. If signal is weak, try grouping rare categories into 'OTHER' or by domain rules.
  3. Target encoding with K-fold + smoothing (careful with leakage). Validate with a holdout and time-based splits if data is temporal.
  4. Feature hashing for scale/streaming needs.
  5. Entity embeddings if you have deep-learning capacity and lots of data.
  6. Try model-native strategies (CatBoost/LightGBM) and compare.

Special notes: temporal data & drift (you saw this coming)

  • If categories evolve (new users/products), evaluate on future data. Do not compute encodings using later data — that’s classic leakage.
  • For drift detection, monitor category distributions and encoding statistics (means/counts). If they shift, retrain or adapt encodings (online updates, adaptive smoothing).
  • Rare categories and positive-unlabeled settings: treat singletons carefully—maybe group them or use strong shrinkage. Rare categories are like rare events: they carry noise and potentially spurious signal.

Quick guide: handling unseen categories in production

  • Always provide a fallback encoding: global mean, frequency bin, or hashed bucket.
  • For embeddings/hashing, unseen values map to a default vector or deterministic hash.
  • Log and monitor frequency of "unknown"s — ramping unknowns = drift.

Closing — TL;DR + a tiny rant

  • High-card categorical features are powerful but toxic if untreated. Treat them like gossip: verify before you commit to it.
  • Start simple (counts), graduate to target encoding with K-fold & smoothing, and scale with hashing or embeddings as needed.
  • Remember the lessons from rare events and drift: respect scarcity, avoid leakage, and watch for change.

Final thought: a model that memorizes category IDs at training may be very proud — until it meets the wild, brutal reality of production data. Build features that generalize, not trophies for your train set.

Version note: this builds on the previous module about tree-based models (use model-native categorical tools when available) and the sections on rare events and drift (apply shrinkage and monitor for category shift).

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics