Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

Data Types and Tidy Structure Handling Missing Values Outlier Detection and Treatment Categorical Encoding Schemes Ordinal vs Nominal Encodings Text Features: Bag-of-Words and TF-IDF Date and Time Feature Extraction Scaling and Normalization Techniques Binning and Discretization Interaction and Polynomial Features Target Leakage in Feature Engineering Feature Creation from Domain Knowledge Sparse vs Dense Representations Feature Hashing Basics Managing High Cardinality

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Data Wrangling and Feature Engineering

Data Wrangling and Feature Engineering

25847 views

Practical techniques to clean, encode, scale, and construct informative features while avoiding leakage.

Content

5 of 15

Ordinal vs Nominal Encodings

Ordinal vs Nominal: The Encoding Roast

4057 views

intermediate

humorous

machine learning

gpt-5-mini

4057 views

Versions:

Ordinal vs Nominal: The Encoding Roast

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Ordinal vs Nominal Encodings — The Friendly Roast of Categorical Data

"Encoding is where your categorical data either becomes a hero or a secret saboteur." — Someone who has debugged a mysterious model at 2 AM

You're already past the basics (Foundations of Supervised Learning) and you sat through Categorical Encoding Schemes (Position 4) — so you know there are many ways to translate words into numbers. Here we zoom in on the clash-of-the-titans pair: Ordinal vs Nominal encodings. This is the place where semantics meet math and bad assumptions become model bias.

Why this matters (quick reminder)

Choosing the wrong encoding can create fake order, hurt linear models, confuse distance metrics, or wreck scaling and outlier detection (see our Outlier Detection and Treatment notes, Position 3).
Some learners (linear regression, logistic regression, k-NN, SVM) are sensitive to numeric relationships implied by your encoding. Others (tree-based models) are more forgiving — but "more forgiving" isn't license to be sloppy.

The basic definitions — stop pretending you didn’t know this

Nominal: categories with no intrinsic order. Examples: color = {red, blue, green}, city = {NYC, LA, SF}.
Ordinal: categories with a natural order, but distances between levels are not necessarily equal. Examples: education = {high-school < bachelor < master < phd}, pain_level = {none < mild < moderate < severe}.

Important nuance: ordinal implies order, not equal spacing. A jump from bachelor to master might not be the same 'distance' as master to phd.

Encoding options and when they make sense

1) Nominal variables — do NOT encode as integers

Why not: giving unique integers like red=1, blue=2, green=3 falsely suggests blue is 'twice' red or closer to green than red is.

Good options:

One-Hot Encoding (OHE) — creates binary columns per category. Great for most linear models and distance-based methods.
Binary / Hashing / Target / Embeddings — advanced options if cardinality is high.

When to use OHE: small to medium cardinality; interpretable models.

Code snippet (sketch):

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(sparse=False, drop='first')  # drop first to avoid collinearity
X_nominal = enc.fit_transform(df[['color']])

Caveat: OHE increases dimensionality. If categories are many (high cardinality), consider target encoding, embeddings, or hashing.

2) Ordinal variables — preserve order, but be careful about spacing

Options:

Explicit integer mapping (e.g., high-school=0, bachelor=1, master=2, phd=3). Use when order matters and you believe increasing order correlates (monotonically) with the target.
OrdinalEncoder (sklearn) — similar mapping but for multiple columns.
Custom mapping with domain knowledge — e.g., map pain to {0, 1, 3, 6} if you believe jumps are not uniform.
Alternative: monotonic target encoding or embeddings if relations are complex.

Code snippet (sketch):

from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder(categories=[['high-school','bachelor','master','phd']])
X_ord = enc.fit_transform(df[['education']])

When a simple integer mapping is fine: when order likely correlates with the output and model can leverage monotonicity (e.g., higher education -> higher salary).

When it's not fine: if you map ordinal levels to ints and feed into k-NN or distance-based algorithms where spacing matters differently, or if linear models will assume equal spacing between levels when that’s false.

Model sensitivity cheat-sheet

Model type	Effect of ordinal-as-integers	Effect of one-hot on nominal
Linear models (OLS, logistic)	Interprets numeric spacing — risky if spacing isn't meaningful	Works well; interpretable coefficients per category
Tree-based (RandomForest, XGBoost)	Often robust — trees split on thresholds so arbitrary ints are less harmful	Also works fine; sometimes OHE unnecessary and increases complexity
Distance-based (k-NN, KMeans)	Bad: distances will be impacted by arbitrary numeric assignments	Good: OHE avoids false distances but increases space dimensionality
Neural nets	Can learn relationships but need careful embedding if cardinality high	Embeddings often best for high-cardinality categories

Practical pitfalls (the stuff that makes models lie)

LabelEncoder misuse: People use sklearn.LabelEncoder on nominal features and think it's fine. It introduces arbitrary order.
Assuming equal spacing: Encoding ordinal as 0,1,2 assumes equal spacing. If not true, your linear model will misattribute effects.
Dummy variable trap: Never forget collinearity when using OHE with intercepts — drop one column or use regularization.
Leakage in target encoding: If you target-encode categories, do it within CV folds — otherwise you leak target info.
Outlier/scale interactions: Turning categories into numbers can create artificial 'outliers' that influence scaling and outlier detection (see Outlier Detection and Treatment). For example, mapping rare category -> 100 can be flagged as an outlier.

Decision flow (quick checklist)

Is the categorical variable ordered by nature? If no, treat as nominal. If yes, treat as ordinal.
If ordinal: can you reasonably assign numeric scores that reflect the underlying distance? If yes, use ordinal mapping; if no, consider monotonic target encoding or embeddings.
Check model type: for linear models, prefer OHE for nominal; consider ordinal mapping carefully for ordinal features. For trees, ordinal-as-ints often OK but still verify.
For high cardinality nominal variables: avoid OHE; use hashing, target, or embeddings.
Validate: run experiments with both encodings in CV and inspect performance and coefficients/feature importances.

Tiny worked example (mental model)

Imagine "education" vs "favorite_color":

education = ordinal — mapping bachelor=1, master=2, phd=3 could be fine (but consider spaces!).
favorite_color = nominal — don't map blue=1, green=2; one-hot it instead.

Result: If you accidentally integer-encode color, a linear model might learn a slope where none exists. If you one-hot encode education (instead of using order), you lose the monotonic signal but gain flexibility.

Quick engineering recipes (battle-tested)

For ordinal with clear levels and monotonic expectation: map to integers, but test with one-hot to be safe.
For nominal small-cardinality: one-hot with drop='first' (or regularize heavily).
For nominal high-cardinality: use frequency/target/hash/embeddings — never naive ints.
Always pipeline: encoding -> scaling (if needed) -> model inside sklearn Pipeline to prevent leakage.

Closing — the takeaways (so you don’t forget at 2 AM)

Order matters. If a feature is ordinal, keep the order. If it's nominal, never invent one.
Spacing also matters. Ordinal integers imply spacing — be honest about what spacing means for your data and model.
Model-aware encoding. Match encoding to model type and cardinality; trees forgive, linear models punish sloppy numeric semantics.
Always validate. Try alternative encodings, check CV, inspect feature effects, and cross-check with domain knowledge.

Final mic drop: Encoding is not just syntax — it’s semantics dressed up as numbers. If you encode the world wrong, your model will confidently be wrong.

Next up (recommended): revisit Categorical Encoding Schemes notes to compare target encoding and embeddings for tricky high-cardinality cases, and re-check your outlier pipeline to make sure encodings haven't introduced ghost outliers.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics