Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

Data Types and Tidy Structure Handling Missing Values Outlier Detection and Treatment Categorical Encoding Schemes Ordinal vs Nominal Encodings Text Features: Bag-of-Words and TF-IDF Date and Time Feature Extraction Scaling and Normalization Techniques Binning and Discretization Interaction and Polynomial Features Target Leakage in Feature Engineering Feature Creation from Domain Knowledge Sparse vs Dense Representations Feature Hashing Basics Managing High Cardinality

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Data Wrangling and Feature Engineering

Data Wrangling and Feature Engineering

25847 views

Practical techniques to clean, encode, scale, and construct informative features while avoiding leakage.

Content

1 of 15

Data Types and Tidy Structure

Tidy Data: Marie Kondo for Features (Chaotic TA Edition)

6809 views

intermediate

humorous

visual

science

gpt-5-mini

6809 views

Versions:

Tidy Data: Marie Kondo for Features (Chaotic TA Edition)

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Data Types and Tidy Structure — Marie Kondo for Your Features

"If your dataset doesn't spark clarity, it sparks bugs." — Probably me, 3am, debugging a model

You're already fluent in the language of labels and supervision (how we frame regression vs classification) and you know to lock down randomness with seeds for reproducibility. Now we get to the boring-but-heroic work: making your data behave. This is where models get their diet and your results stop looking like interpretive dance.

Why this matters (without lecturing you again on bias-variance)

A messy dataset will: slow training, sabotage cross-validation, give you subtly wrong feature importances, and make reproducibility a lie. Tidy structure + correct types = predictable preprocessing pipelines, easier feature engineering, and models that actually generalize instead of memorizing your spreadsheet's eccentricities.

Think of data types and tidy structure as the plumbing of ML. If you create a good plumbing, you won't drown in bugs when you add fancy fixtures (regularization, ensembles, neural nets). If the pipes are trash, nothing else matters.

Core concepts, quick and caffeinated

1) Tidy data principle (Hadley Wickham's gospel)

Each variable is a column. (Features and target are columns.)
Each observation is a row. (One row = one example.)
Each value is a single cell. (No lists, no comma-separated horrors.)

Imagine a party: tidy data is everyone standing in single-file, name tags on, ready for a headcount. Messy data is people piled on couches, one person holding three name tags.

2) Fundamental data types (and what they imply)

Here's the cheat-sheet you will actually read during a 2am preprocessing session.

Type	Machine meaning	Common ML treatment	Example
Numeric — Integer	Discrete numbers	use as-is or scale; think counts	number_of_visits = 3
Numeric — Float	Continuous values	scale/normalize; can have many decimals	price = 19.99
Categorical (Nominal)	Categories without order	one-hot or target encoding	color = {red, blue}
Ordinal	Categories with order	map to integers or monotonic encoding	size = {small<med<large}
Boolean	True/False	convert to 0/1	is_subscribed = True
Datetime	Timestamp	extract features (year, month, weekday, elapsed)	event_time = 2021-07-01 12:00
Text	Free-form string	NLP pipelines, embeddings, feature extraction	review_text = "meh"
Mixed / Object	Mixed types inside a column	clean and split into proper types	misc = "12kg" or "N/A"

Practical checklist: Make your dataset behave (step-by-step)

Inspect dtypes immediately

import pandas as pd
df.dtypes

Enforce tidy structure

Pivot wide → long for repeated measures
Split combined columns ("age_height")
Explode list-like cells into rows

Convert intended categories to categorical dtype

In pandas: df['region'] = df['region'].astype('category')
Benefits: memory, speed, meaningful factor levels

Handle datetimes properly

Parse to datetime and then extract features:

df['timestamp'] = pd.to_datetime(df['timestamp'])
df['hour'] = df['timestamp'].dt.hour

Deal with missingness deliberately

Missingness is information. Flag it, don't bury it.
Create is_missing indicators for important features.

Normalize/scale numerical features when needed

Regression often benefits from scaling (especially regularized models).

Lock preprocessing for reproducibility

Fit encoders/scalers on training only, save them. Reuse on test/production.
Document and persist dtype expectations.

Encoding: one-hot vs ordinal vs target — the soap opera

One-hot: safe for nominal categories; beware dimensionality explosion.
Ordinal encoding: only for ordered categories (e.g., education level).
Target/Mean encoding: powerful for high-cardinality features, but leaks if you don't use proper CV/fitting strategy.

Question time: Why do people keep misunderstanding this? Because they assume any encoder is just a switch. Encoding is a modeling decision with bias/variance implications.

Examples & mini case study

Imagine a dataset for predicting house prices (regression). Columns:

id (drop at model time)
sale_price (target)
postcode (categorical, high-cardinality)
num_bedrooms (int)
built_year (int → derive age)
sale_date (datetime)
notes (text)

Tidy transformation plan:

Convert postcode → category, consider target encoding with CV because many unique postcodes
Create house_age = sale_date.year - built_year
Extract season or month from sale_date
Clean notes → sentiment score or drop if noisy
Ensure sale_price stays as float for regression

Mini code snippet:

# enforce dtypes
df['postcode'] = df['postcode'].astype('category')
df['sale_date'] = pd.to_datetime(df['sale_date'])
df['house_age'] = df['sale_date'].dt.year - df['built_year']

Memory and performance hacks (because your laptop is not a cloud instance)

Use categorical dtype for repeated text fields.
Downcast integers/floats when appropriate: df[col] = pd.to_numeric(df[col], downcast='integer')
Only keep necessary columns when iterating or joining.

Small wins add up: smaller memory = faster cross-validation = more experiments before you cry.

Reproducibility tie-in (a friendly reminder from Position 13)

You learned to set seeds for stochastic processes. Apply the same discipline to preprocessing:

Save fitted scalers/encoders as artifacts.
Document dtype expectations in a schema (e.g., Great Expectations, pandera).
If your train/test split depends on time or groups, make type-aware decisions — e.g., date parsing must be deterministic.

This is how you ensure your 85% accuracy isn't a hallucination caused by inconsistent preprocessing.

Common pitfalls (and how to dodge them)

Treating IDs as features — they leak everything and nothing simultaneously. Drop or transform.
Encoding before splitting — target leakage central casting.
Leaving mixed types in an object column — models will tantrum.
Ignoring datetime timezone issues — results shift subtly across regions.

Closing: The tiny rituals that keep models honest

Always visualize dtypes and sample rows after major transformations.
Automate dtype enforcement in your pipeline (use schemas).
Treat tidy structure as a contract: features in columns, observations in rows, target separate.

Key takeaways:

Tidy data + correct types = faster experiments + fewer silent errors.
Encoding is a modeling choice — match the encoder to the semantic type.
Persist preprocessing for reproducibility — don't invent new pipelines for train/test.

Final thought (dramatic): If modeling is rock climbing, data wrangling is setting the rope anchors. Do it sloppily and the climb becomes an expensive lecture on gravity.

Version note: builds on your understanding of labels (where the target lives) and reproducibility (where deterministic preprocessing matters). Next stop: feature generation — turning tidy columns into model-winning insights.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics