Data Cleaning and Feature Engineering
Prepare high-quality datasets with robust transformations and informative features while avoiding leakage.
Content
Encoding Categorical Variables
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Encoding Categorical Variables — Practical Guide for Python Data Scientists
"Categorical data: the part of your dataset that looks like English but secretly hates your model."
You've already learned to wrangle data with pandas, impute missing values, and scale numeric features. Encoding categorical variables is the next logical step — it's where words become numbers and models stop complaining. This guide builds on Imputation Strategies and Scaling & Normalization and shows how to convert categories into model-ready features without creating a disaster for memory, performance, or statistical validity.
Why encoding matters (and when to care)
- Machine learning algorithms expect numbers. Trees can sometimes work with categories at a high level, but most scikit-learn models, linear models, and neural nets need numeric inputs.
- Wrong encoding = bias, leakage, or explosion of dimensions. One-hot a 10,000-category column and watch your RAM cry.
Types of categorical variables:
- Nominal: unordered labels (e.g., color = {red, blue, green})
- Ordinal: ordered categories (e.g., size = {small < medium < large})
- High-cardinality: many unique levels (e.g., user_id, product_sku)
Quick decision map (which encoding to pick)
- Ordinal? → Ordinal / explicit mapping
- Nominal & small cardinality? → One-hot / get_dummies / OneHotEncoder
- Nominal & high cardinality? → Target encoding / frequency / hashing / embeddings
- Want pipeline + safe test-time behavior? → Use ColumnTransformer + OneHotEncoder(handle_unknown='ignore')
Encoding methods (with examples)
1) Ordinal encoding — when order matters
Use when categories have a clear order.
# pandas mapping
df['size_ord'] = df['size'].map({'small': 0, 'medium': 1, 'large': 2})
# sklearn
from sklearn.preprocessing import OrdinalEncoder
ord = OrdinalEncoder(categories=[['small','medium','large']])
df[['size_ord']] = ord.fit_transform(df[['size']])
Micro explanation: Mapping preserves order but imposes distances (small→large = 2). Only use when that numeric interpretation is meaningful.
2) One-hot encoding — go-to for small nominal features
Produces a binary column per level.
# pandas
pd.get_dummies(df, columns=['color'], drop_first=False)
# sklearn pipeline (robust)
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
cat_transformer = OneHotEncoder(handle_unknown='ignore', sparse=False)
preprocessor = ColumnTransformer(transformers=[('cat', cat_transformer, ['color','city'])])
Tips:
- Use
handle_unknown='ignore'to avoid errors on unseen categories in test data. - Consider
drop='first'or use linear model with regularization to avoid the dummy variable trap (multicollinearity). - After one-hot, no scaling usually required for tree models. For linear models, you may optionally scale.
3) Target / mean encoding — compact but dangerous if misused
Replace category with average target (or smoothed mean). Powerful for high-cardinality categorical predictors.
Pitfall: data leakage. If you compute encodings using the whole dataset, you leak target information into training. Use K-fold encoding or compute encodings on training folds and apply to validation/test.
# pseudo-code idea
# for each fold: compute mean(target) per category on train_fold, map to val_fold
# use regularization: posterior = (count*category_mean + prior*global_mean) / (count + prior_weight)
Use libraries: category_encoders.TargetEncoder or custom CV-based implementations.
4) Frequency / count encoding — simple & robust
Replace each category by its frequency or count. Fast, deterministic, and avoids leakage when computed on training only.
freq = df['cat'].value_counts(normalize=True)
df['cat_freq'] = df['cat'].map(freq)
Good baseline for high-cardinality features.
5) Hashing trick — streaming & memory-efficient
sklearn.feature_extraction.FeatureHasher maps categories to fixed-size vectors using a hash. No need to store a vocabulary; unseen categories handled naturally.
Pros: memory-safe, constant-size output. Cons: possible collisions and less interpretability.
6) Binary / BaseN & other encodings
Binary or ordinal-compact encodings (e.g., binary encoding, base-n) reduce dimensionality compared to one-hot and keep some information.
Libraries: category_encoders provides binary, hash, count, target, and more.
Practical pipelines and tips (pandas + sklearn)
- Impute categorical missing values before encoding. From your Imputation lesson: use
SimpleImputer(strategy='constant', fill_value='missing')or treatNaNas its own category. - Keep consistent categories across train/test: fit encoders on training data only and persist the encoder.
- Avoid leakage with target-based encodings: use K-fold target encoding or compute on training set and apply to test with smoothing.
- Combine rare categories: group rare levels into 'Other' to reduce cardinality.
- Memory check: if one-hot leads to too many columns, switch to frequency/target/hash/embedding.
Example sklearn pipeline combining imputation and one-hot:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
cat_cols = ['city','color']
cat_pipeline = Pipeline([('impute', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False))])
preprocessor = ColumnTransformer([('cat', cat_pipeline, cat_cols)])
model_pipeline = Pipeline([('pre', preprocessor), ('clf', LogisticRegression())])
Edge cases & gotchas
- Unseen categories in production: Always use encoders with
handle_unknown='ignore'or hashing. - Sparse output: Keep
sparse=Truefor OneHotEncoder when dealing with many columns to save memory. - Interaction with scaling: One-hot features don't need scaling for tree models; for linear/NNs, consider scaling numeric features after encoding as appropriate.
- Interpreting coefficients: With one-hot, coefficients correspond to categories vs baseline. With target-encoding, coefficients reflect smoothed target rates — interpret carefully.
Quick cookbook — what to do, fast
- Small nominal (<20 levels): One-hot
- Ordinal: Map to integers with meaningful order
- High-cardinality: Frequency / Target (with CV) / Hash / Embedding
- Missing values: Fill with "missing" OR use SimpleImputer
- Pipelines: Always use ColumnTransformer + Pipeline to avoid leakage and guarantee reproducibility
Key takeaways
- Encoding is not one-size-fits-all. Choose method by variable type, cardinality, model, and memory constraints.
- Prevent leakage. Target-based methods are powerful but require CV-based implementations to be safe.
- Use pipelines. Fit encoders only on training data and persist them for production.
"Encoding categorical variables well is like teaching a shy translator how to introduce your data to the model — do it clumsily and the party ends early; do it thoughtfully and the model actually listens."
Further reading / next steps: implement target encoding with cross-validation, explore category_encoders, and try embeddings for categorical variables when using neural networks.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!