Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

Simple Linear Regression Geometry Multiple Linear Regression Formulation Assumptions and Diagnostics Ordinary Least Squares Solution Gradient Descent for OLS Heteroscedasticity and Robust Losses Transformations of Targets and Features Categorical Variables in Regression Interaction Terms in Linear Models Multicollinearity and VIF Prediction Intervals vs Confidence Intervals Feature Scaling Effects in OLS Handling Outliers with Huber and Quantile Loss Model Interpretation with Coefficients Baseline and Dummy Regressors

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Regression I: Linear Models

Regression I: Linear Models

24992 views

Build and diagnose linear regression models, understand assumptions, and evaluate predictive performance.

Content

1 of 15

Simple Linear Regression Geometry

Geometry but Make It Sass

6149 views

intermediate

humorous

machine learning

visual

gpt-5-mini

6149 views

Versions:

Geometry but Make It Sass

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Simple Linear Regression Geometry — The Line, The Shadow, The Drama

"If you can see the geometry, you won't forget the algebra." — Someone nerdy and dramatic

Alright, we've already been the boring-but-brilliant caretakers of Train/Validation/Test splits and wrestled with cross-validation shenanigans (yes, especially for imbalanced data and data snooping). Now we're zooming in: what actually happens when you fit a simple linear regression? Spoiler: it's projection, drama, and a little bit of linear-algebra theater.

What this is (fast) — and why it matters

Rather than reheating the resampling lecture, let's use that lens: understanding the geometry of simple linear regression helps you see why some points ruin cross-validation folds, why the model variance behaves the way it does, and why diagnostics like leverage and residual plots aren't optional accessories — they're survival tools.

Simple linear regression = predicting y from one x (and usually an intercept). Algebra says slope = cov(x,y)/var(x). Geometry says: you're projecting the vector y onto the subspace spanned by the predictors and calling that projection your prediction y_hat. That's not metaphor — it's literal.

The core geometric picture

Consider your data as vectors in R^n (n = number of observations).
y is an n-dimensional vector of responses.
The design matrix X (with an intercept) spans a 2-dimensional subspace: the column of ones and the column of x.
The fitted values y_hat are the orthogonal projection of y onto span(X).
The residuals e = y - y_hat are orthogonal to that subspace.

In formulas (vector form):

beta_hat = (X^T X)^{-1} X^T y
y_hat = X beta_hat = H y   # where H = X (X^T X)^{-1} X^T is the "hat" matrix
e = y - y_hat = (I - H) y

Properties you should memorize (because they're like little factoids that save you from embarrassment at meetings):

e is orthogonal to every column of X: X^T e = 0.
Sum of residuals = 0 (when you include an intercept).
The hat matrix H is symmetric and idempotent (H = H^T, H^2 = H).
The diagonal elements h_i of H are leverages.

Visual metaphors (because your brain likes pictures)

Imagine a flashlight (y) shining onto a wall defined by span(X). The shadow on the wall is y_hat. The part of y not in the shadow is e.
If the wall is a line (no intercept, just x), you're projecting onto a 1D line. If you include an intercept, you're projecting onto a 2D plane in n-space.

Why orthogonality? Because least squares chooses the projection that minimizes squared vertical distances, which — in vector language — is the orthogonal projection minimizing ||y - y_hat||^2.

Sum-of-squares decomposition: the Pythagorean of regression

Because of orthogonality:

SST = SSR + SSE

Where:

SST (total sum of squares) = ||y - mean(y)||^2
SSR (regression sum of squares) = ||y_hat - mean(y)||^2
SSE (error sum of squares) = ||y - y_hat||^2

And R^2 = SSR / SST. Geometrically, R^2 = cos^2(theta), where theta is the angle between the centered y and its projection y_hat.

So when someone says "R^2 is the squared correlation between y and y_hat" — they're not being poetic; it's literally the cosine-squared of an angle in n-space.

Simple regression special-case: correlation and slope

If both x and y are centered (mean zero), then

beta_hat = (x^T y) / (x^T x) = cov(x,y)/var(x)
R^2 = (corr(x,y))^2

So simple linear regression = correlation in action. If corr(x,y) = 0.8, R^2 = 0.64. Geometry: the projection keeps 80% of the direction (cos theta = 0.8), squared length = 0.64.

Leverage and influence — the VIP seats at the regression table

Leverage h_i = diagonal(H)_i. For simple regression with intercept:

h_i = 1/n + (x_i - x_bar)^2 / Sxx
Sxx = sum (x_i - x_bar)^2

Points with x far from x_bar have high leverage (they stand far out on the x-axis) and can pull the line.
Influence ~ leverage * (large residual). A high-leverage point with small residual: meh. High leverage plus large residual: catastrophic. That's where Cook's distance lives.

Practical tie-back to cross-validation: if one fold contains a high-leverage influential point and another doesn't, your CV estimate will look unstable. That's not just variance — it's structural sensitivity.

Why this geometry helps you reason about bias/variance

A simple model (like one predictor) restricts predictions to a low-dimensional subspace => low variance, potentially high bias.
Adding predictors expands the subspace; your projections can capture more components of y, reducing bias but increasing variance and sensitivity to leverage and overfitting.

If in learning curves you saw low training error but high validation error when adding features, geometrically that's because the projection subspace grew and started approximating noise directions in y.

Quick checklist: what to compute & inspect

Compute beta_hat via closed form (or QR for numerics).
Plot y vs x and the fitted line + residuals.
Check hat diagonal h_i; anything > 2*(p+1)/n is suspect (rule-of-thumb).
Inspect residuals for orthogonality: residuals should have no linear trend with x.
If you see a fold-to-fold CV jump: check for influential points appearing/disappearing between folds.

Code-like formulas (pseudocode):

# Given X (n x p) and y (n):
beta_hat = inv(X.T @ X) @ X.T @ y
H = X @ inv(X.T @ X) @ X.T
h = diag(H)
y_hat = H @ y
resid = y - y_hat

TL;DR / Key takeaways

Simple linear regression is orthogonal projection: y_hat is the shadow of y onto span(X).
Residuals are orthogonal to predictors (X^T e = 0); that's the geometry behind least squares.
R^2 = SSR/SST = cos^2(angle between y and y_hat). In centered simple regression, R^2 = corr(x,y)^2.
Leverage (diagonal of the hat matrix) explains how some x's can pull your line; influence = leverage × large residual.
Geometric thinking explains why CV instability, data-snooping, and learning-curve behavior happen when certain points or extra dimensions matter.

Want to be slightly dangerous/very confident next time? Visualize y, y_hat, and e as vectors. When you can see orthogonality, projection, and leverage in your head, the diagnostics start to make sense, and your models stop surprising you (or at least they surprise you less dramatically).

Next up: we'll generalize this projection party to multiple predictors and see how subspaces, collinearity, and QR factorizations change the choreography. Bring snacks.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics