Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

Supervised vs Unsupervised vs Reinforcement Inputs, Targets, and Hypothesis Space Bias–Variance Trade-off Underfitting and Overfitting Empirical Risk Minimization Loss Functions Overview Probabilistic Perspective of Supervised Learning Optimization Basics for ML Gradient Descent and Variants Stochasticity and Mini-batching Evaluation vs Training Objectives Data Leakage Pitfalls Reproducibility and Random Seeds Problem Framing: Regression vs Classification Types of Supervision and Labels

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Foundations of Supervised Learning

Foundations of Supervised Learning

14132 views

Core concepts, goals, trade-offs, and terminology that underpin regression and classification.

Content

2 of 15

Inputs, Targets, and Hypothesis Space

Hypothesis Space: The Chaotic Kitchen of ML

2517 views

beginner

humorous

machine learning

visual

gpt-5-mini

2517 views

Versions:

Hypothesis Space: The Chaotic Kitchen of ML

Watch & Learn

AI-discovered learning video

YouTube

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Foundations of Supervised Learning — Inputs, Targets, and Hypothesis Space

You already know the difference between supervised, unsupervised, and reinforcement learning. Good. Now let’s get our hands dirty with the nuts and bolts that actually make supervised learning work: the inputs (features), the targets (labels), and the hypothesis space (the universe of functions our algorithm is allowed to consider).

Why this matters (quick reminder)

You learned earlier that supervised learning is about learning a mapping from observations to outcomes using labeled examples. That statement hides three huge questions we now unpack:

What are we observing? (Inputs)
What are we predicting? (Targets)
What kind of mappings are we allowed to consider? (Hypothesis space)

Get these three wrong (or sloppy) and you’ll get models that are confused, overconfident, or quietly useless.

1) Inputs (aka features, covariates, X)

Definition: The input space is the set of all possible observations we feed into our model. Usually denoted X (uppercase) for the space, and x (lowercase) for a single example.

Typical forms: numeric vectors, images, text, categorical variables, time series.
Practical issues: missing values, scaling, encoding, correlated features, and feature engineering.

Analogy: Inputs are the ingredients. If you give the chef rotten avocados, you can’t expect a Michelin-level guacamole no matter how skilled the chef is.

Questions to ask about inputs:

Is each feature meaningful for the task? (garbage in → garbage out)
Are features on wildly different scales? (standardize/normalize)
Do I need to create new features? (polynomials, interactions)

2) Targets (aka labels, y)

Definition: The target is the quantity we want to predict. It lives in the output space Y. Usually denoted y.

Types of targets:

Regression: Continuous y (house price, temperature)
Classification: Discrete y (spam/not spam, dog breed)
Structured outputs: Sequences, images, graphs (harder, but still supervised)

Important subtleties:

Label noise: humans make mistakes. Your model might just learn human inconsistency.
Imbalanced classes: if 99% of examples are class A, accuracy becomes a liar. Use precision/recall, AUC, or resampling.

Analogy: Targets are the recipe you aim to cook. If the recipe says cake but you actually want cookies, your chef will comply but you’ll be miserable.

3) Hypothesis Space (aka hypothesis class)

Definition: The hypothesis space H is the set of functions f : X → Y that our learning algorithm can pick from. When we say we’re “training a model,” we’re searching H for the best f according to some loss on the data.

Notation example:

H = { f_theta(x) : theta in Theta }

This means: our hypotheses are parameterized by some theta values in parameter space Theta.

Why it’s the real star (and villain)

If H is too small (low capacity), no function in H fits the true relationship → underfitting.
If H is huge (high capacity), you're flexible enough to fit noise → overfitting.

This tradeoff is the backbone of model selection.

Common hypothesis spaces

Hypothesis class	Typical representation	Capacity	When it's useful
Linear models	f(x)=w^T x + b	Low	When relationships are roughly linear; interpretable
Decision trees	Tree of splits	Medium	Nonlinear interactions; tabular data
k-NN	Instance memory + distance	Variable (grows with data)	Simple, nonparametric, sensitive to noise
Neural networks	Layered nonlinearities	High	Complex patterns (images, audio), big data

A rule of thumb: pick the simplest H that can express the patterns you need.

Hypothesis space, loss, and learning — the holy trinity

Learning = searching H to minimize expected loss L(y, f(x)). In practice we minimize empirical loss plus regularization:

f_hat = argmin_{f in H} (1/n) sum_i L(y_i, f(x_i)) + lambda * R(f)

The loss (e.g., MSE, cross-entropy) ties hypotheses to what we care about.
Regularization (R) restricts effective hypothesis complexity (penalize big weights, tree depth, etc.).

Think of regularization as a leash: your hypothesis class might be a caffeinated greyhound, and regularization is the sensible owner holding the leash so the dog doesn’t sprint after every squirrel (noise) it sees.

Inductive bias and why we need it

Every learning algorithm carries assumptions — inductive bias. Without bias, learning is impossible (no free lunch theorem says so). Examples:

Linear models assume linearity.
k-NN assumes similar inputs → similar outputs (locality).
Neural nets assume compositional hierarchical features.

Bias vs. variance: the joke that’s also a theorem. High-bias models underfit; high-variance models overfit. Good learning finds the sweet spot.

Practical checklist — before you start training

Define input space X clearly (raw features, transformations).
Define target space Y and proper evaluation metric (accuracy, RMSE, F1...).
Choose an initial H that matches problem complexity.
Think about regularization and validation (cross-validation).
Ask: is the data representative of the world you’ll use the model in? If not, no hypothesis in H will fix that.

Tiny pseudocode: what learning looks like

Given: data (X, Y), hypothesis class H, loss L, regularizer R
for candidate f in H:    # in practice, we optimize parameters rather than enumerate
    compute empirical_loss = mean(L(y_i, f(x_i)))
    objective = empirical_loss + lambda * R(f)
select f_hat that minimizes objective
return f_hat

Closing — big picture and takeaways

Inputs are your ingredients. Clean them. Engineer them. Respect them.
Targets are what you’re trying to bake. Make sure the recipe is correct and measurable.
Hypothesis space is the kitchen rules: what tools and recipes your algorithm can use. Too small and you starve; too big and you eat the entire pantry and regret everything.

Final mic-drop insight:

The model you get is not just a product of the data — it's the interaction of data, the hypothesis space you pick, the loss you care about, and the inductive biases you accept. Tweak any one of these and the learned function changes. Treat them all like they’re alive.

Next up: we’ll see how specific choices of hypothesis class (linear vs tree vs neural net) behave on real data — and why sometimes a humble linear model beats a flashy deep net. Spoiler: it’s not about glamour; it’s about fit.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics