Foundations of Supervised Learning
Core concepts, goals, trade-offs, and terminology that underpin regression and classification.
Content
Inputs, Targets, and Hypothesis Space
Versions:
Watch & Learn
AI-discovered learning video
Foundations of Supervised Learning — Inputs, Targets, and Hypothesis Space
You already know the difference between supervised, unsupervised, and reinforcement learning. Good. Now let’s get our hands dirty with the nuts and bolts that actually make supervised learning work: the inputs (features), the targets (labels), and the hypothesis space (the universe of functions our algorithm is allowed to consider).
Why this matters (quick reminder)
You learned earlier that supervised learning is about learning a mapping from observations to outcomes using labeled examples. That statement hides three huge questions we now unpack:
- What are we observing? (Inputs)
- What are we predicting? (Targets)
- What kind of mappings are we allowed to consider? (Hypothesis space)
Get these three wrong (or sloppy) and you’ll get models that are confused, overconfident, or quietly useless.
1) Inputs (aka features, covariates, X)
Definition: The input space is the set of all possible observations we feed into our model. Usually denoted X (uppercase) for the space, and x (lowercase) for a single example.
- Typical forms: numeric vectors, images, text, categorical variables, time series.
- Practical issues: missing values, scaling, encoding, correlated features, and feature engineering.
Analogy: Inputs are the ingredients. If you give the chef rotten avocados, you can’t expect a Michelin-level guacamole no matter how skilled the chef is.
Questions to ask about inputs:
- Is each feature meaningful for the task? (garbage in → garbage out)
- Are features on wildly different scales? (standardize/normalize)
- Do I need to create new features? (polynomials, interactions)
2) Targets (aka labels, y)
Definition: The target is the quantity we want to predict. It lives in the output space Y. Usually denoted y.
Types of targets:
- Regression: Continuous y (house price, temperature)
- Classification: Discrete y (spam/not spam, dog breed)
- Structured outputs: Sequences, images, graphs (harder, but still supervised)
Important subtleties:
- Label noise: humans make mistakes. Your model might just learn human inconsistency.
- Imbalanced classes: if 99% of examples are class A, accuracy becomes a liar. Use precision/recall, AUC, or resampling.
Analogy: Targets are the recipe you aim to cook. If the recipe says cake but you actually want cookies, your chef will comply but you’ll be miserable.
3) Hypothesis Space (aka hypothesis class)
Definition: The hypothesis space H is the set of functions f : X → Y that our learning algorithm can pick from. When we say we’re “training a model,” we’re searching H for the best f according to some loss on the data.
Notation example:
H = { f_theta(x) : theta in Theta }
This means: our hypotheses are parameterized by some theta values in parameter space Theta.
Why it’s the real star (and villain)
- If H is too small (low capacity), no function in H fits the true relationship → underfitting.
- If H is huge (high capacity), you're flexible enough to fit noise → overfitting.
This tradeoff is the backbone of model selection.
Common hypothesis spaces
| Hypothesis class | Typical representation | Capacity | When it's useful |
|---|---|---|---|
| Linear models | f(x)=w^T x + b | Low | When relationships are roughly linear; interpretable |
| Decision trees | Tree of splits | Medium | Nonlinear interactions; tabular data |
| k-NN | Instance memory + distance | Variable (grows with data) | Simple, nonparametric, sensitive to noise |
| Neural networks | Layered nonlinearities | High | Complex patterns (images, audio), big data |
A rule of thumb: pick the simplest H that can express the patterns you need.
Hypothesis space, loss, and learning — the holy trinity
Learning = searching H to minimize expected loss L(y, f(x)). In practice we minimize empirical loss plus regularization:
f_hat = argmin_{f in H} (1/n) sum_i L(y_i, f(x_i)) + lambda * R(f)
- The loss (e.g., MSE, cross-entropy) ties hypotheses to what we care about.
- Regularization (R) restricts effective hypothesis complexity (penalize big weights, tree depth, etc.).
Think of regularization as a leash: your hypothesis class might be a caffeinated greyhound, and regularization is the sensible owner holding the leash so the dog doesn’t sprint after every squirrel (noise) it sees.
Inductive bias and why we need it
Every learning algorithm carries assumptions — inductive bias. Without bias, learning is impossible (no free lunch theorem says so). Examples:
- Linear models assume linearity.
- k-NN assumes similar inputs → similar outputs (locality).
- Neural nets assume compositional hierarchical features.
Bias vs. variance: the joke that’s also a theorem. High-bias models underfit; high-variance models overfit. Good learning finds the sweet spot.
Practical checklist — before you start training
- Define input space X clearly (raw features, transformations).
- Define target space Y and proper evaluation metric (accuracy, RMSE, F1...).
- Choose an initial H that matches problem complexity.
- Think about regularization and validation (cross-validation).
- Ask: is the data representative of the world you’ll use the model in? If not, no hypothesis in H will fix that.
Tiny pseudocode: what learning looks like
Given: data (X, Y), hypothesis class H, loss L, regularizer R
for candidate f in H: # in practice, we optimize parameters rather than enumerate
compute empirical_loss = mean(L(y_i, f(x_i)))
objective = empirical_loss + lambda * R(f)
select f_hat that minimizes objective
return f_hat
Closing — big picture and takeaways
- Inputs are your ingredients. Clean them. Engineer them. Respect them.
- Targets are what you’re trying to bake. Make sure the recipe is correct and measurable.
- Hypothesis space is the kitchen rules: what tools and recipes your algorithm can use. Too small and you starve; too big and you eat the entire pantry and regret everything.
Final mic-drop insight:
The model you get is not just a product of the data — it's the interaction of data, the hypothesis space you pick, the loss you care about, and the inductive biases you accept. Tweak any one of these and the learned function changes. Treat them all like they’re alive.
Next up: we’ll see how specific choices of hypothesis class (linear vs tree vs neural net) behave on real data — and why sometimes a humble linear model beats a flashy deep net. Spoiler: it’s not about glamour; it’s about fit.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!