Data Wrangling and Feature Engineering
Practical techniques to clean, encode, scale, and construct informative features while avoiding leakage.
Content
Ordinal vs Nominal Encodings
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Ordinal vs Nominal Encodings — The Friendly Roast of Categorical Data
"Encoding is where your categorical data either becomes a hero or a secret saboteur." — Someone who has debugged a mysterious model at 2 AM
You're already past the basics (Foundations of Supervised Learning) and you sat through Categorical Encoding Schemes (Position 4) — so you know there are many ways to translate words into numbers. Here we zoom in on the clash-of-the-titans pair: Ordinal vs Nominal encodings. This is the place where semantics meet math and bad assumptions become model bias.
Why this matters (quick reminder)
- Choosing the wrong encoding can create fake order, hurt linear models, confuse distance metrics, or wreck scaling and outlier detection (see our Outlier Detection and Treatment notes, Position 3).
- Some learners (linear regression, logistic regression, k-NN, SVM) are sensitive to numeric relationships implied by your encoding. Others (tree-based models) are more forgiving — but "more forgiving" isn't license to be sloppy.
The basic definitions — stop pretending you didn’t know this
- Nominal: categories with no intrinsic order. Examples: color = {red, blue, green}, city = {NYC, LA, SF}.
- Ordinal: categories with a natural order, but distances between levels are not necessarily equal. Examples: education = {high-school < bachelor < master < phd}, pain_level = {none < mild < moderate < severe}.
Important nuance: ordinal implies order, not equal spacing. A jump from bachelor to master might not be the same 'distance' as master to phd.
Encoding options and when they make sense
1) Nominal variables — do NOT encode as integers
Why not: giving unique integers like red=1, blue=2, green=3 falsely suggests blue is 'twice' red or closer to green than red is.
Good options:
- One-Hot Encoding (OHE) — creates binary columns per category. Great for most linear models and distance-based methods.
- Binary / Hashing / Target / Embeddings — advanced options if cardinality is high.
When to use OHE: small to medium cardinality; interpretable models.
Code snippet (sketch):
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(sparse=False, drop='first') # drop first to avoid collinearity
X_nominal = enc.fit_transform(df[['color']])
Caveat: OHE increases dimensionality. If categories are many (high cardinality), consider target encoding, embeddings, or hashing.
2) Ordinal variables — preserve order, but be careful about spacing
Options:
- Explicit integer mapping (e.g., high-school=0, bachelor=1, master=2, phd=3). Use when order matters and you believe increasing order correlates (monotonically) with the target.
- OrdinalEncoder (sklearn) — similar mapping but for multiple columns.
- Custom mapping with domain knowledge — e.g., map
painto {0, 1, 3, 6} if you believe jumps are not uniform. - Alternative: monotonic target encoding or embeddings if relations are complex.
Code snippet (sketch):
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder(categories=[['high-school','bachelor','master','phd']])
X_ord = enc.fit_transform(df[['education']])
When a simple integer mapping is fine: when order likely correlates with the output and model can leverage monotonicity (e.g., higher education -> higher salary).
When it's not fine: if you map ordinal levels to ints and feed into k-NN or distance-based algorithms where spacing matters differently, or if linear models will assume equal spacing between levels when that’s false.
Model sensitivity cheat-sheet
| Model type | Effect of ordinal-as-integers | Effect of one-hot on nominal |
|---|---|---|
| Linear models (OLS, logistic) | Interprets numeric spacing — risky if spacing isn't meaningful | Works well; interpretable coefficients per category |
| Tree-based (RandomForest, XGBoost) | Often robust — trees split on thresholds so arbitrary ints are less harmful | Also works fine; sometimes OHE unnecessary and increases complexity |
| Distance-based (k-NN, KMeans) | Bad: distances will be impacted by arbitrary numeric assignments | Good: OHE avoids false distances but increases space dimensionality |
| Neural nets | Can learn relationships but need careful embedding if cardinality high | Embeddings often best for high-cardinality categories |
Practical pitfalls (the stuff that makes models lie)
- LabelEncoder misuse: People use sklearn.LabelEncoder on nominal features and think it's fine. It introduces arbitrary order.
- Assuming equal spacing: Encoding ordinal as 0,1,2 assumes equal spacing. If not true, your linear model will misattribute effects.
- Dummy variable trap: Never forget collinearity when using OHE with intercepts — drop one column or use regularization.
- Leakage in target encoding: If you target-encode categories, do it within CV folds — otherwise you leak target info.
- Outlier/scale interactions: Turning categories into numbers can create artificial 'outliers' that influence scaling and outlier detection (see Outlier Detection and Treatment). For example, mapping rare category -> 100 can be flagged as an outlier.
Decision flow (quick checklist)
- Is the categorical variable ordered by nature? If no, treat as nominal. If yes, treat as ordinal.
- If ordinal: can you reasonably assign numeric scores that reflect the underlying distance? If yes, use ordinal mapping; if no, consider monotonic target encoding or embeddings.
- Check model type: for linear models, prefer OHE for nominal; consider ordinal mapping carefully for ordinal features. For trees, ordinal-as-ints often OK but still verify.
- For high cardinality nominal variables: avoid OHE; use hashing, target, or embeddings.
- Validate: run experiments with both encodings in CV and inspect performance and coefficients/feature importances.
Tiny worked example (mental model)
Imagine "education" vs "favorite_color":
- education = ordinal — mapping bachelor=1, master=2, phd=3 could be fine (but consider spaces!).
- favorite_color = nominal — don't map blue=1, green=2; one-hot it instead.
Result: If you accidentally integer-encode color, a linear model might learn a slope where none exists. If you one-hot encode education (instead of using order), you lose the monotonic signal but gain flexibility.
Quick engineering recipes (battle-tested)
- For ordinal with clear levels and monotonic expectation: map to integers, but test with one-hot to be safe.
- For nominal small-cardinality: one-hot with drop='first' (or regularize heavily).
- For nominal high-cardinality: use frequency/target/hash/embeddings — never naive ints.
- Always pipeline: encoding -> scaling (if needed) -> model inside sklearn Pipeline to prevent leakage.
Closing — the takeaways (so you don’t forget at 2 AM)
- Order matters. If a feature is ordinal, keep the order. If it's nominal, never invent one.
- Spacing also matters. Ordinal integers imply spacing — be honest about what spacing means for your data and model.
- Model-aware encoding. Match encoding to model type and cardinality; trees forgive, linear models punish sloppy numeric semantics.
- Always validate. Try alternative encodings, check CV, inspect feature effects, and cross-check with domain knowledge.
Final mic drop: Encoding is not just syntax — it’s semantics dressed up as numbers. If you encode the world wrong, your model will confidently be wrong.
Next up (recommended): revisit Categorical Encoding Schemes notes to compare target encoding and embeddings for tricky high-cardinality cases, and re-check your outlier pipeline to make sure encodings haven't introduced ghost outliers.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!