Data Wrangling and Feature Engineering
Practical techniques to clean, encode, scale, and construct informative features while avoiding leakage.
Content
Data Types and Tidy Structure
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Data Types and Tidy Structure — Marie Kondo for Your Features
"If your dataset doesn't spark clarity, it sparks bugs." — Probably me, 3am, debugging a model
You're already fluent in the language of labels and supervision (how we frame regression vs classification) and you know to lock down randomness with seeds for reproducibility. Now we get to the boring-but-heroic work: making your data behave. This is where models get their diet and your results stop looking like interpretive dance.
Why this matters (without lecturing you again on bias-variance)
A messy dataset will: slow training, sabotage cross-validation, give you subtly wrong feature importances, and make reproducibility a lie. Tidy structure + correct types = predictable preprocessing pipelines, easier feature engineering, and models that actually generalize instead of memorizing your spreadsheet's eccentricities.
Think of data types and tidy structure as the plumbing of ML. If you create a good plumbing, you won't drown in bugs when you add fancy fixtures (regularization, ensembles, neural nets). If the pipes are trash, nothing else matters.
Core concepts, quick and caffeinated
1) Tidy data principle (Hadley Wickham's gospel)
- Each variable is a column. (Features and target are columns.)
- Each observation is a row. (One row = one example.)
- Each value is a single cell. (No lists, no comma-separated horrors.)
Imagine a party: tidy data is everyone standing in single-file, name tags on, ready for a headcount. Messy data is people piled on couches, one person holding three name tags.
2) Fundamental data types (and what they imply)
Here's the cheat-sheet you will actually read during a 2am preprocessing session.
| Type | Machine meaning | Common ML treatment | Example |
|---|---|---|---|
| Numeric — Integer | Discrete numbers | use as-is or scale; think counts | number_of_visits = 3 |
| Numeric — Float | Continuous values | scale/normalize; can have many decimals | price = 19.99 |
| Categorical (Nominal) | Categories without order | one-hot or target encoding | color = {red, blue} |
| Ordinal | Categories with order | map to integers or monotonic encoding | size = {small<med<large} |
| Boolean | True/False | convert to 0/1 | is_subscribed = True |
| Datetime | Timestamp | extract features (year, month, weekday, elapsed) | event_time = 2021-07-01 12:00 |
| Text | Free-form string | NLP pipelines, embeddings, feature extraction | review_text = "meh" |
| Mixed / Object | Mixed types inside a column | clean and split into proper types | misc = "12kg" or "N/A" |
Practical checklist: Make your dataset behave (step-by-step)
- Inspect dtypes immediately
import pandas as pd
df.dtypes
- Enforce tidy structure
- Pivot wide → long for repeated measures
- Split combined columns ("age_height")
- Explode list-like cells into rows
- Convert intended categories to categorical dtype
- In pandas: df['region'] = df['region'].astype('category')
- Benefits: memory, speed, meaningful factor levels
- Handle datetimes properly
- Parse to datetime and then extract features:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['hour'] = df['timestamp'].dt.hour
- Deal with missingness deliberately
- Missingness is information. Flag it, don't bury it.
- Create is_missing indicators for important features.
- Normalize/scale numerical features when needed
- Regression often benefits from scaling (especially regularized models).
- Lock preprocessing for reproducibility
- Fit encoders/scalers on training only, save them. Reuse on test/production.
- Document and persist dtype expectations.
Encoding: one-hot vs ordinal vs target — the soap opera
- One-hot: safe for nominal categories; beware dimensionality explosion.
- Ordinal encoding: only for ordered categories (e.g., education level).
- Target/Mean encoding: powerful for high-cardinality features, but leaks if you don't use proper CV/fitting strategy.
Question time: Why do people keep misunderstanding this? Because they assume any encoder is just a switch. Encoding is a modeling decision with bias/variance implications.
Examples & mini case study
Imagine a dataset for predicting house prices (regression). Columns:
- id (drop at model time)
- sale_price (target)
- postcode (categorical, high-cardinality)
- num_bedrooms (int)
- built_year (int → derive age)
- sale_date (datetime)
- notes (text)
Tidy transformation plan:
- Convert postcode → category, consider target encoding with CV because many unique postcodes
- Create house_age = sale_date.year - built_year
- Extract season or month from sale_date
- Clean notes → sentiment score or drop if noisy
- Ensure sale_price stays as float for regression
Mini code snippet:
# enforce dtypes
df['postcode'] = df['postcode'].astype('category')
df['sale_date'] = pd.to_datetime(df['sale_date'])
df['house_age'] = df['sale_date'].dt.year - df['built_year']
Memory and performance hacks (because your laptop is not a cloud instance)
- Use categorical dtype for repeated text fields.
- Downcast integers/floats when appropriate: df[col] = pd.to_numeric(df[col], downcast='integer')
- Only keep necessary columns when iterating or joining.
Small wins add up: smaller memory = faster cross-validation = more experiments before you cry.
Reproducibility tie-in (a friendly reminder from Position 13)
You learned to set seeds for stochastic processes. Apply the same discipline to preprocessing:
- Save fitted scalers/encoders as artifacts.
- Document dtype expectations in a schema (e.g., Great Expectations, pandera).
- If your train/test split depends on time or groups, make type-aware decisions — e.g., date parsing must be deterministic.
This is how you ensure your 85% accuracy isn't a hallucination caused by inconsistent preprocessing.
Common pitfalls (and how to dodge them)
- Treating IDs as features — they leak everything and nothing simultaneously. Drop or transform.
- Encoding before splitting — target leakage central casting.
- Leaving mixed types in an object column — models will tantrum.
- Ignoring datetime timezone issues — results shift subtly across regions.
Closing: The tiny rituals that keep models honest
- Always visualize dtypes and sample rows after major transformations.
- Automate dtype enforcement in your pipeline (use schemas).
- Treat tidy structure as a contract: features in columns, observations in rows, target separate.
Key takeaways:
- Tidy data + correct types = faster experiments + fewer silent errors.
- Encoding is a modeling choice — match the encoder to the semantic type.
- Persist preprocessing for reproducibility — don't invent new pipelines for train/test.
Final thought (dramatic): If modeling is rock climbing, data wrangling is setting the rope anchors. Do it sloppily and the climb becomes an expensive lecture on gravity.
Version note: builds on your understanding of labels (where the target lives) and reproducibility (where deterministic preprocessing matters). Next stop: feature generation — turning tidy columns into model-winning insights.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!