Regression I: Linear Models
Build and diagnose linear regression models, understand assumptions, and evaluate predictive performance.
Content
Simple Linear Regression Geometry
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Simple Linear Regression Geometry — The Line, The Shadow, The Drama
"If you can see the geometry, you won't forget the algebra." — Someone nerdy and dramatic
Alright, we've already been the boring-but-brilliant caretakers of Train/Validation/Test splits and wrestled with cross-validation shenanigans (yes, especially for imbalanced data and data snooping). Now we're zooming in: what actually happens when you fit a simple linear regression? Spoiler: it's projection, drama, and a little bit of linear-algebra theater.
What this is (fast) — and why it matters
Rather than reheating the resampling lecture, let's use that lens: understanding the geometry of simple linear regression helps you see why some points ruin cross-validation folds, why the model variance behaves the way it does, and why diagnostics like leverage and residual plots aren't optional accessories — they're survival tools.
Simple linear regression = predicting y from one x (and usually an intercept). Algebra says slope = cov(x,y)/var(x). Geometry says: you're projecting the vector y onto the subspace spanned by the predictors and calling that projection your prediction y_hat. That's not metaphor — it's literal.
The core geometric picture
- Consider your data as vectors in R^n (n = number of observations).
- y is an n-dimensional vector of responses.
- The design matrix X (with an intercept) spans a 2-dimensional subspace: the column of ones and the column of x.
- The fitted values y_hat are the orthogonal projection of y onto span(X).
- The residuals e = y - y_hat are orthogonal to that subspace.
In formulas (vector form):
beta_hat = (X^T X)^{-1} X^T y
y_hat = X beta_hat = H y # where H = X (X^T X)^{-1} X^T is the "hat" matrix
e = y - y_hat = (I - H) y
Properties you should memorize (because they're like little factoids that save you from embarrassment at meetings):
- e is orthogonal to every column of X: X^T e = 0.
- Sum of residuals = 0 (when you include an intercept).
- The hat matrix H is symmetric and idempotent (H = H^T, H^2 = H).
- The diagonal elements h_i of H are leverages.
Visual metaphors (because your brain likes pictures)
- Imagine a flashlight (y) shining onto a wall defined by span(X). The shadow on the wall is y_hat. The part of y not in the shadow is e.
- If the wall is a line (no intercept, just x), you're projecting onto a 1D line. If you include an intercept, you're projecting onto a 2D plane in n-space.
Why orthogonality? Because least squares chooses the projection that minimizes squared vertical distances, which — in vector language — is the orthogonal projection minimizing ||y - y_hat||^2.
Sum-of-squares decomposition: the Pythagorean of regression
Because of orthogonality:
- SST = SSR + SSE
Where:
- SST (total sum of squares) = ||y - mean(y)||^2
- SSR (regression sum of squares) = ||y_hat - mean(y)||^2
- SSE (error sum of squares) = ||y - y_hat||^2
And R^2 = SSR / SST. Geometrically, R^2 = cos^2(theta), where theta is the angle between the centered y and its projection y_hat.
So when someone says "R^2 is the squared correlation between y and y_hat" — they're not being poetic; it's literally the cosine-squared of an angle in n-space.
Simple regression special-case: correlation and slope
If both x and y are centered (mean zero), then
beta_hat = (x^T y) / (x^T x) = cov(x,y)/var(x)
R^2 = (corr(x,y))^2
So simple linear regression = correlation in action. If corr(x,y) = 0.8, R^2 = 0.64. Geometry: the projection keeps 80% of the direction (cos theta = 0.8), squared length = 0.64.
Leverage and influence — the VIP seats at the regression table
Leverage h_i = diagonal(H)_i. For simple regression with intercept:
h_i = 1/n + (x_i - x_bar)^2 / Sxx
Sxx = sum (x_i - x_bar)^2
- Points with x far from x_bar have high leverage (they stand far out on the x-axis) and can pull the line.
- Influence ~ leverage * (large residual). A high-leverage point with small residual: meh. High leverage plus large residual: catastrophic. That's where Cook's distance lives.
Practical tie-back to cross-validation: if one fold contains a high-leverage influential point and another doesn't, your CV estimate will look unstable. That's not just variance — it's structural sensitivity.
Why this geometry helps you reason about bias/variance
- A simple model (like one predictor) restricts predictions to a low-dimensional subspace => low variance, potentially high bias.
- Adding predictors expands the subspace; your projections can capture more components of y, reducing bias but increasing variance and sensitivity to leverage and overfitting.
If in learning curves you saw low training error but high validation error when adding features, geometrically that's because the projection subspace grew and started approximating noise directions in y.
Quick checklist: what to compute & inspect
- Compute beta_hat via closed form (or QR for numerics).
- Plot y vs x and the fitted line + residuals.
- Check hat diagonal h_i; anything > 2*(p+1)/n is suspect (rule-of-thumb).
- Inspect residuals for orthogonality: residuals should have no linear trend with x.
- If you see a fold-to-fold CV jump: check for influential points appearing/disappearing between folds.
Code-like formulas (pseudocode):
# Given X (n x p) and y (n):
beta_hat = inv(X.T @ X) @ X.T @ y
H = X @ inv(X.T @ X) @ X.T
h = diag(H)
y_hat = H @ y
resid = y - y_hat
TL;DR / Key takeaways
- Simple linear regression is orthogonal projection: y_hat is the shadow of y onto span(X).
- Residuals are orthogonal to predictors (X^T e = 0); that's the geometry behind least squares.
- R^2 = SSR/SST = cos^2(angle between y and y_hat). In centered simple regression, R^2 = corr(x,y)^2.
- Leverage (diagonal of the hat matrix) explains how some x's can pull your line; influence = leverage × large residual.
- Geometric thinking explains why CV instability, data-snooping, and learning-curve behavior happen when certain points or extra dimensions matter.
Want to be slightly dangerous/very confident next time? Visualize y, y_hat, and e as vectors. When you can see orthogonality, projection, and leverage in your head, the diagnostics start to make sense, and your models stop surprising you (or at least they surprise you less dramatically).
Next up: we'll generalize this projection party to multiple predictors and see how subspaces, collinearity, and QR factorizations change the choreography. Bring snacks.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!