Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

Data Types and Tidy Structure Handling Missing Values Outlier Detection and Treatment Categorical Encoding Schemes Ordinal vs Nominal Encodings Text Features: Bag-of-Words and TF-IDF Date and Time Feature Extraction Scaling and Normalization Techniques Binning and Discretization Interaction and Polynomial Features Target Leakage in Feature Engineering Feature Creation from Domain Knowledge Sparse vs Dense Representations Feature Hashing Basics Managing High Cardinality

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Data Wrangling and Feature Engineering

Data Wrangling and Feature Engineering

25847 views

Practical techniques to clean, encode, scale, and construct informative features while avoiding leakage.

Content

4 of 15

Categorical Encoding Schemes

Categorical Encoding but Make It Sass

733 views

intermediate

humorous

data science

visual

gpt-5-mini

733 views

Versions:

Categorical Encoding but Make It Sass

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Categorical Encoding Schemes — Turn Names into Numbers Without Losing Your Mind

You already cleaned missing values and hunted down outliers like a data-version of a pest-control technician. Now meet the thing that decides whether your model sees categorical variables as elegant signals or chaotic noise: categorical encoding.

We build on the earlier modules (Handling Missing Values; Outlier Detection and Treatment) and the Foundations of Supervised Learning. You've seen why features matter; now we decide how to represent non-numeric features so models can learn from them.

Why encoding matters (and why you should care)

Models don't speak human. They only speak numbers. A category like 'blue', 'red', 'green' must become numeric without injecting fake order or leaking the target. Encoding choices can make or break linear models, nudge tree models differently, and wreck or boost performance when categories are high-cardinality (think: zip codes, user IDs).

Quick questions to keep you honest:

Does the encoding impose an order where none exists? (Bad for unordered categories.)
Does it create thousands of dummy columns and explode memory? (Bad for production.)
Does it leak target information into features during preprocessing? (Very, very bad.)

Common encoding schemes (the toolbox)

1) One-Hot Encoding

What: Create a binary column per category.
Good for: Small cardinality, models that benefit from orthogonal features (logistic regression).
Bad for: High-cardinality — leads to huge sparse matrices.

# pandas-style
pd.get_dummies(df['color'], prefix='color', drop_first=False)

2) Label (Ordinal) Encoding

What: Map categories to integers (red->1, blue->2).
Good for: Ordinal categories (small, true order: 'low','medium','high').
Bad for: Nominal categories — introduces spurious order.

3) Frequency / Count Encoding

What: Replace category with its frequency or count in data.
Good for: High-cardinality compactness; often surprisingly powerful for tree models.
Caveat: Could still leak temporal shifts if counts change across time.

4) Target / Mean Encoding (a.k.a. Impact Encoding)

What: Replace category with mean(target) for that category, perhaps smoothed.
Good for: High-cardinality, predictive signal concentrated in category target rates.
Danger: Target leakage if calculated on full train set. Use cross-validation, K-fold schemes, or leave-one-out properly.

Pseudo-smoothing formula (helps avoid noisy estimates):

mean_enc = (countcategory_mean + priorglobal_mean) / (count + prior)

Where 'prior' is a hyperparameter controlling shrinkage toward the global mean.

5) Binary / Hashing / BaseN Encoding

What: Convert label-encoded integer into binary digits or hash into fixed number of buckets.
Good for: Keeping dimension under control; hashing helps when you need a fixed-size vector.
Bad for: Hash collisions can mix categories unpredictably.

6) Learned Embeddings (Deep Models)

What: Let the model (e.g., a neural network) learn a dense vector per category.
Good for: Very high-cardinality with relational structure (users, words). Efficient and expressive.
Bad for: Needs lots of data; less interpretable.

7) Leave-One-Out / Cross-Fold Target Encoding

What: Variant of target encoding that computes the mean target excluding the current row or via out-of-fold estimates.
Why: Prevents direct leakage and overfitting when using target statistics.

Practical comparisons (cheat-sheet)

Encoder	Memory	Interpretability	High-cardinality?	Leakage Risk	Best for
One-hot	High	High	No	Low	Linear models, small vocab
Ordinal	Low	Medium	No	Low	Ordered categories
Frequency	Low	Medium	Yes	Low	Trees, compactness
Target/Mean	Low	Low	Yes	High (if naive)	High-cardinality predictive signal
Hashing	Low	Low	Yes	Low (but collisions)	Streaming, fixed-dim needs
Embedding	Very Low (dense)	Low	Yes	Low (needs training)	Deep models, complex interactions

Real-world analogies (because metaphors stick)

One-hot encoding is like giving every friend their own throw pillow — comfortable, but your couch runs out of space.
Target encoding is like asking the group what each friend’s favorite cocktail is and using that score to predict party success — informative, but rude unless done carefully (no peeking at results beforehand).
Hashing is getting a fixed-size group photo; sometimes faces overlap and you can't tell who’s who later.

Pitfalls & how to avoid them

Target leakage with mean encoding
- Always compute target stats using out-of-fold or train-only splits. Never use the full dataset means.
Overfitting to rare categories
- Smooth mean encodings toward global mean (use the prior hyperparameter).
- Group rare categories into an 'OTHER' bucket or use frequency encoding.
Unseen categories at inference
- Use 'unknown' bin, hash into bucket, or fallback to global mean/frequency.
Interactions & model choice
- Linear models need careful one-hot or target-coded features to capture effects. Trees can handle label codes but sometimes benefit more from frequency or target encodings.

Short recipes (practical pipelines)

Small cardinality (<= 10): One-hot (or ordinal if ordered).
Medium cardinality (10–100): Frequency + one-hot for top-k categories; group others.
High cardinality (>100): Target/mean encoding (with CV smoothing) or hashing/embeddings.
Trees vs Linear: Trees tolerate label- and frequency-encoded features; linear models typically need one-hot or carefully regularized target encoding.

Example: K-fold target encoding pseudocode

for fold in KFold(n_splits=5):
    train_fold = df[train_idx]
    val_fold = df[val_idx]
    enc_map = train_fold.groupby('cat')['target'].mean()
    df.loc[val_idx, 'cat_te'] = val_fold['cat'].map(enc_map).fillna(global_mean)
# After loop, for test set map using full-training enc_map (or use smoothing)

Mini checklist before training

Are categorical missing values encoded consistently with numeric missing value strategy from previous lesson?
Did you check cardinality and group rare categories?
Did you protect target-encoding with CV or smoothing to avoid leakage?
Do you have a plan for unseen categories at inference time?

Closing: Key takeaways

Encoding choice matters more than you think — it affects model bias, variance, interpretability, and performance.
For small vocabularies, be explicit and interpretable (one-hot). For huge vocabularies, compress: frequency, target (carefully), hashing, or embeddings.
Never leak target info — use out-of-fold schemes or smoothing.

Final thought: treating categories well is like social etiquette for data. If you respect them (correct encoding, smoothing, handling rare/unseen), they’ll reveal patterns politely. If you shove them into naive numeric boxes, they’ll gossip to the model and you’ll get nothing but noise and bad vibes.

Go forth, encode wisely, and may your features be informative and your models robust.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics