Courses/Python for Data Science, AI & Development/Data Cleaning and Feature Engineering

Data Cleaning and Feature Engineering

43380 views

Prepare high-quality datasets with robust transformations and informative features while avoiding leakage.

Content

4 of 15

Encoding Categorical Variables

Encoding Categorical Variables in Python for Data Science

4286 views

beginner

python

feature-engineering

pandas

data-science

gpt-5-mini

4286 views

Versions:

Encoding Categorical Variables in Python for Data Science

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Encoding Categorical Variables — Practical Guide for Python Data Scientists

"Categorical data: the part of your dataset that looks like English but secretly hates your model."

You've already learned to wrangle data with pandas, impute missing values, and scale numeric features. Encoding categorical variables is the next logical step — it's where words become numbers and models stop complaining. This guide builds on Imputation Strategies and Scaling & Normalization and shows how to convert categories into model-ready features without creating a disaster for memory, performance, or statistical validity.

Why encoding matters (and when to care)

Machine learning algorithms expect numbers. Trees can sometimes work with categories at a high level, but most scikit-learn models, linear models, and neural nets need numeric inputs.
Wrong encoding = bias, leakage, or explosion of dimensions. One-hot a 10,000-category column and watch your RAM cry.

Types of categorical variables:

Nominal: unordered labels (e.g., color = {red, blue, green})
Ordinal: ordered categories (e.g., size = {small < medium < large})
High-cardinality: many unique levels (e.g., user_id, product_sku)

Quick decision map (which encoding to pick)

Ordinal? → Ordinal / explicit mapping
Nominal & small cardinality? → One-hot / get_dummies / OneHotEncoder
Nominal & high cardinality? → Target encoding / frequency / hashing / embeddings
Want pipeline + safe test-time behavior? → Use ColumnTransformer + OneHotEncoder(handle_unknown='ignore')

Encoding methods (with examples)

1) Ordinal encoding — when order matters

Use when categories have a clear order.

# pandas mapping
df['size_ord'] = df['size'].map({'small': 0, 'medium': 1, 'large': 2})

# sklearn
from sklearn.preprocessing import OrdinalEncoder
ord = OrdinalEncoder(categories=[['small','medium','large']])
df[['size_ord']] = ord.fit_transform(df[['size']])

Micro explanation: Mapping preserves order but imposes distances (small→large = 2). Only use when that numeric interpretation is meaningful.

2) One-hot encoding — go-to for small nominal features

Produces a binary column per level.

# pandas
pd.get_dummies(df, columns=['color'], drop_first=False)

# sklearn pipeline (robust)
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
cat_transformer = OneHotEncoder(handle_unknown='ignore', sparse=False)
preprocessor = ColumnTransformer(transformers=[('cat', cat_transformer, ['color','city'])])

Tips:

Use handle_unknown='ignore' to avoid errors on unseen categories in test data.
Consider drop='first' or use linear model with regularization to avoid the dummy variable trap (multicollinearity).
After one-hot, no scaling usually required for tree models. For linear models, you may optionally scale.

3) Target / mean encoding — compact but dangerous if misused

Replace category with average target (or smoothed mean). Powerful for high-cardinality categorical predictors.

Pitfall: data leakage. If you compute encodings using the whole dataset, you leak target information into training. Use K-fold encoding or compute encodings on training folds and apply to validation/test.

# pseudo-code idea
# for each fold: compute mean(target) per category on train_fold, map to val_fold
# use regularization: posterior = (count*category_mean + prior*global_mean) / (count + prior_weight)

Use libraries: category_encoders.TargetEncoder or custom CV-based implementations.

4) Frequency / count encoding — simple & robust

Replace each category by its frequency or count. Fast, deterministic, and avoids leakage when computed on training only.

freq = df['cat'].value_counts(normalize=True)
df['cat_freq'] = df['cat'].map(freq)

Good baseline for high-cardinality features.

5) Hashing trick — streaming & memory-efficient

sklearn.feature_extraction.FeatureHasher maps categories to fixed-size vectors using a hash. No need to store a vocabulary; unseen categories handled naturally.

Pros: memory-safe, constant-size output. Cons: possible collisions and less interpretability.

6) Binary / BaseN & other encodings

Binary or ordinal-compact encodings (e.g., binary encoding, base-n) reduce dimensionality compared to one-hot and keep some information.

Libraries: category_encoders provides binary, hash, count, target, and more.

Practical pipelines and tips (pandas + sklearn)

Impute categorical missing values before encoding. From your Imputation lesson: use SimpleImputer(strategy='constant', fill_value='missing') or treat NaN as its own category.
Keep consistent categories across train/test: fit encoders on training data only and persist the encoder.
Avoid leakage with target-based encodings: use K-fold target encoding or compute on training set and apply to test with smoothing.
Combine rare categories: group rare levels into 'Other' to reduce cardinality.
Memory check: if one-hot leads to too many columns, switch to frequency/target/hash/embedding.

Example sklearn pipeline combining imputation and one-hot:

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

cat_cols = ['city','color']
cat_pipeline = Pipeline([('impute', SimpleImputer(strategy='constant', fill_value='missing')),
                         ('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False))])
preprocessor = ColumnTransformer([('cat', cat_pipeline, cat_cols)])

model_pipeline = Pipeline([('pre', preprocessor), ('clf', LogisticRegression())])

Edge cases & gotchas

Unseen categories in production: Always use encoders with handle_unknown='ignore' or hashing.
Sparse output: Keep sparse=True for OneHotEncoder when dealing with many columns to save memory.
Interaction with scaling: One-hot features don't need scaling for tree models; for linear/NNs, consider scaling numeric features after encoding as appropriate.
Interpreting coefficients: With one-hot, coefficients correspond to categories vs baseline. With target-encoding, coefficients reflect smoothed target rates — interpret carefully.

Quick cookbook — what to do, fast

Small nominal (<20 levels): One-hot
Ordinal: Map to integers with meaningful order
High-cardinality: Frequency / Target (with CV) / Hash / Embedding
Missing values: Fill with "missing" OR use SimpleImputer
Pipelines: Always use ColumnTransformer + Pipeline to avoid leakage and guarantee reproducibility

Key takeaways

Encoding is not one-size-fits-all. Choose method by variable type, cardinality, model, and memory constraints.
Prevent leakage. Target-based methods are powerful but require CV-based implementations to be safe.
Use pipelines. Fit encoders only on training data and persist them for production.

"Encoding categorical variables well is like teaching a shy translator how to introduce your data to the model — do it clumsily and the party ends early; do it thoughtfully and the model actually listens."

Further reading / next steps: implement target encoding with cross-validation, explore category_encoders, and try embeddings for categorical variables when using neural networks.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics