jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Python for Data Science, AI & Development
Chapters

1Python Foundations for Data Work

2Data Structures and Iteration

3Numerical Computing with NumPy

4Data Analysis with pandas

5Data Cleaning and Feature Engineering

Detecting and Handling OutliersImputation StrategiesScaling and NormalizationEncoding Categorical VariablesFeature Binning and DiscretizationFeature Interactions and PolynomialsText Cleaning BasicsDatetime Parsing and FeaturesAddressing Class ImbalanceTarget Leakage AvoidanceTrain–Validation SplitsPipeline-Friendly TransformsFeature Selection MethodsDimensionality ReductionMulticollinearity and Correlation

6Data Visualization and Storytelling

7Statistics and Probability for Data Science

8Machine Learning with scikit-learn

9Deep Learning Foundations

10Data Sources, Engineering, and Deployment

Courses/Python for Data Science, AI & Development/Data Cleaning and Feature Engineering

Data Cleaning and Feature Engineering

43367 views

Prepare high-quality datasets with robust transformations and informative features while avoiding leakage.

Content

4 of 15

Encoding Categorical Variables

Encoding Categorical Variables in Python for Data Science
4286 views
beginner
python
feature-engineering
pandas
data-science
gpt-5-mini
4286 views

Versions:

Encoding Categorical Variables in Python for Data Science

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Encoding Categorical Variables — Practical Guide for Python Data Scientists

"Categorical data: the part of your dataset that looks like English but secretly hates your model."

You've already learned to wrangle data with pandas, impute missing values, and scale numeric features. Encoding categorical variables is the next logical step — it's where words become numbers and models stop complaining. This guide builds on Imputation Strategies and Scaling & Normalization and shows how to convert categories into model-ready features without creating a disaster for memory, performance, or statistical validity.


Why encoding matters (and when to care)

  • Machine learning algorithms expect numbers. Trees can sometimes work with categories at a high level, but most scikit-learn models, linear models, and neural nets need numeric inputs.
  • Wrong encoding = bias, leakage, or explosion of dimensions. One-hot a 10,000-category column and watch your RAM cry.

Types of categorical variables:

  • Nominal: unordered labels (e.g., color = {red, blue, green})
  • Ordinal: ordered categories (e.g., size = {small < medium < large})
  • High-cardinality: many unique levels (e.g., user_id, product_sku)

Quick decision map (which encoding to pick)

  1. Ordinal? → Ordinal / explicit mapping
  2. Nominal & small cardinality? → One-hot / get_dummies / OneHotEncoder
  3. Nominal & high cardinality? → Target encoding / frequency / hashing / embeddings
  4. Want pipeline + safe test-time behavior? → Use ColumnTransformer + OneHotEncoder(handle_unknown='ignore')

Encoding methods (with examples)

1) Ordinal encoding — when order matters

Use when categories have a clear order.

# pandas mapping
df['size_ord'] = df['size'].map({'small': 0, 'medium': 1, 'large': 2})

# sklearn
from sklearn.preprocessing import OrdinalEncoder
ord = OrdinalEncoder(categories=[['small','medium','large']])
df[['size_ord']] = ord.fit_transform(df[['size']])

Micro explanation: Mapping preserves order but imposes distances (small→large = 2). Only use when that numeric interpretation is meaningful.


2) One-hot encoding — go-to for small nominal features

Produces a binary column per level.

# pandas
pd.get_dummies(df, columns=['color'], drop_first=False)

# sklearn pipeline (robust)
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
cat_transformer = OneHotEncoder(handle_unknown='ignore', sparse=False)
preprocessor = ColumnTransformer(transformers=[('cat', cat_transformer, ['color','city'])])

Tips:

  • Use handle_unknown='ignore' to avoid errors on unseen categories in test data.
  • Consider drop='first' or use linear model with regularization to avoid the dummy variable trap (multicollinearity).
  • After one-hot, no scaling usually required for tree models. For linear models, you may optionally scale.

3) Target / mean encoding — compact but dangerous if misused

Replace category with average target (or smoothed mean). Powerful for high-cardinality categorical predictors.

Pitfall: data leakage. If you compute encodings using the whole dataset, you leak target information into training. Use K-fold encoding or compute encodings on training folds and apply to validation/test.

# pseudo-code idea
# for each fold: compute mean(target) per category on train_fold, map to val_fold
# use regularization: posterior = (count*category_mean + prior*global_mean) / (count + prior_weight)

Use libraries: category_encoders.TargetEncoder or custom CV-based implementations.


4) Frequency / count encoding — simple & robust

Replace each category by its frequency or count. Fast, deterministic, and avoids leakage when computed on training only.

freq = df['cat'].value_counts(normalize=True)
df['cat_freq'] = df['cat'].map(freq)

Good baseline for high-cardinality features.


5) Hashing trick — streaming & memory-efficient

sklearn.feature_extraction.FeatureHasher maps categories to fixed-size vectors using a hash. No need to store a vocabulary; unseen categories handled naturally.

Pros: memory-safe, constant-size output. Cons: possible collisions and less interpretability.


6) Binary / BaseN & other encodings

Binary or ordinal-compact encodings (e.g., binary encoding, base-n) reduce dimensionality compared to one-hot and keep some information.

Libraries: category_encoders provides binary, hash, count, target, and more.


Practical pipelines and tips (pandas + sklearn)

  1. Impute categorical missing values before encoding. From your Imputation lesson: use SimpleImputer(strategy='constant', fill_value='missing') or treat NaN as its own category.
  2. Keep consistent categories across train/test: fit encoders on training data only and persist the encoder.
  3. Avoid leakage with target-based encodings: use K-fold target encoding or compute on training set and apply to test with smoothing.
  4. Combine rare categories: group rare levels into 'Other' to reduce cardinality.
  5. Memory check: if one-hot leads to too many columns, switch to frequency/target/hash/embedding.

Example sklearn pipeline combining imputation and one-hot:

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

cat_cols = ['city','color']
cat_pipeline = Pipeline([('impute', SimpleImputer(strategy='constant', fill_value='missing')),
                         ('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False))])
preprocessor = ColumnTransformer([('cat', cat_pipeline, cat_cols)])

model_pipeline = Pipeline([('pre', preprocessor), ('clf', LogisticRegression())])

Edge cases & gotchas

  • Unseen categories in production: Always use encoders with handle_unknown='ignore' or hashing.
  • Sparse output: Keep sparse=True for OneHotEncoder when dealing with many columns to save memory.
  • Interaction with scaling: One-hot features don't need scaling for tree models; for linear/NNs, consider scaling numeric features after encoding as appropriate.
  • Interpreting coefficients: With one-hot, coefficients correspond to categories vs baseline. With target-encoding, coefficients reflect smoothed target rates — interpret carefully.

Quick cookbook — what to do, fast

  • Small nominal (<20 levels): One-hot
  • Ordinal: Map to integers with meaningful order
  • High-cardinality: Frequency / Target (with CV) / Hash / Embedding
  • Missing values: Fill with "missing" OR use SimpleImputer
  • Pipelines: Always use ColumnTransformer + Pipeline to avoid leakage and guarantee reproducibility

Key takeaways

  • Encoding is not one-size-fits-all. Choose method by variable type, cardinality, model, and memory constraints.
  • Prevent leakage. Target-based methods are powerful but require CV-based implementations to be safe.
  • Use pipelines. Fit encoders only on training data and persist them for production.

"Encoding categorical variables well is like teaching a shy translator how to introduce your data to the model — do it clumsily and the party ends early; do it thoughtfully and the model actually listens."


Further reading / next steps: implement target encoding with cross-validation, explore category_encoders, and try embeddings for categorical variables when using neural networks.

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics