Regression I: Linear Models
Build and diagnose linear regression models, understand assumptions, and evaluate predictive performance.
Content
Multiple Linear Regression Formulation
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Regression I: Linear Models — Multiple Linear Regression Formulation
"If one line explained your world, multiple lines explain your apartment complex." — A disgruntled apartment manager, probably a statistician
Opening: Why this matters (without repeating the basics)
You already know Simple Linear Regression: a single predictor, a slope, a line, and a geometric picture where we project y onto the span of a single x (recall Simple Linear Regression Geometry). Now imagine you have not one annoying roommate (predictor) but a whole cast of characters: age, income, education, hours of sleep, and whether they own a plant. Multiple Linear Regression (MLR) is the thing that figures out how each roommate contributes to the rent bill — conditional on the others.
This section builds on the geometry intuition and the evaluation practices from our cross-validation module: we'll use matrix geometry to see OLS as projection, talk about assumptions and pitfalls (hello, multicollinearity and leakage), and note how regularization + proper cross-validation is your best friend when features start to gossip with each other.
What MLR is (fast, then deep)
Multiple Linear Regression models a continuous response y as a linear combination of p predictors x1, x2, ..., xp plus noise. In compact matrix form:
y = X beta + eps
- y is an n×1 vector of responses
- X is an n×(p+1) design matrix (first column usually all ones for the intercept)
- beta is a (p+1)×1 vector of coefficients
- eps is noise (zero mean, usually assumed homoskedastic and uncorrelated)
The ordinary least squares (OLS) estimate minimizes the sum of squared residuals and has the closed-form solution:
beta_hat = (X'X)^{-1} X' y
Geometric reminder: beta_hat produces the projection of y onto the column space of X (the span of your predictors). If X's columns are nearly collinear, that column space is squishy and the projection becomes unstable.
Step-by-step: From formulation to practice
Design matrix construction
- Include an intercept column of ones unless you have a reason not to.
- For categorical variables, use dummy/one-hot encoding and be mindful of the dummy-variable trap (drop one level to avoid perfect multicollinearity).
Fit OLS (closed form)
# NumPy-style pseudocode
X = add_intercept(X_raw)
beta_hat = np.linalg.inv(X.T @ X) @ X.T @ y
- Predictions and residuals
y_hat = X @ beta_hat
residuals = y - y_hat
- Geometry and diagnostics
- Hat matrix H = X (X'X)^{-1} X' maps y to y_hat. Diagonal entries h_ii are leverages.
- Large leverage + large residual → influential point (Cook's distance helps quantify this).
Assumptions (the checklist you’ll pretend to skim)
- Linearity: Relationship between predictors and response is linear in parameters.
- Independence: Observations are independent (watch time series and clustered data).
- Homoskedasticity: Constant variance of errors.
- No perfect multicollinearity: X'X invertible (or at least full rank).
- Normally distributed errors (for exact finite-sample inference; not required for unbiasedness).
Pro tip: If X'X is nearly singular, parameter estimates blow up (high variance). Regularize or remove redundant features.
Common wrinkles and what to do about them
Multicollinearity: predictors correlated with each other.
- Effect: large standard errors, unstable coefficients (signs flip like a weather vane).
- Fixes: remove/recombine features, PCA, or use regularization (Ridge, Lasso).
Interactions & nonlinearity:
- Add interaction terms (x1*x2) or polynomial terms (x^2) — but remember: this increases p, so watch overfitting.
Categorical predictors:
- Use one-hot encoding; drop one level to keep X full rank.
Feature scaling:
- OLS doesn't require scaling for correctness, but scaling helps interpretation and is essential for regularized methods.
Leakage & data-snooping:
- Never fit or transform (e.g., scale, PCA) on the full dataset before splitting. Use your train/validation/test pipeline like we said in the cross-validation lecture.
- If using cross-validation to tune regularization strength, ensure the fold splits are correct (stratify if needed, avoid time leakage, etc.).
Where regularization enters the party (and how it connects to CV)
Ridge (L2) shrinks coefficients to reduce variance when predictors chatter with each other:
beta_ridge = (X'X + lambda I)^{-1} X' y
Lambda controls the tradeoff bias ↔ variance. How pick lambda? Cross-validation — but do it properly: compute lambda via CV on the training data only (no peeking at test). This is exactly the evaluation discipline we emphasized earlier.
Pseudocode for CV tuning (sketch):
for lambda in grid:
cv_scores = []
for train_idx, val_idx in folds:
fit ridge on X[train_idx], y[train_idx]
score on X[val_idx], y[val_idx]
cv_scores.append(mean(cv_scores))
select lambda with best cv score
refit on full training set with that lambda
evaluate on held-out test
Quick comparison table
| Concept | Simple Linear | Multiple Linear | With Ridge |
|---|---|---|---|
| Parameters | slope + intercept | p slopes + intercept | p slopes + intercept shrunk |
| Geometry | projection onto 1-dim subspace | projection onto p-dim column space | biased projection but lower variance |
| Problems | outliers affect fit | multicollinearity, leverage points | mitigates multicollinearity |
Mini example (real-world flavor)
Imagine predicting house prices using area, bedrooms, age, and distance to subway. Area and bedrooms are correlated (big houses usually have more bedrooms). In OLS, coefficients for bedrooms might be noisy or even negative — a telltale sign of multicollinearity. Use Ridge and cross-validate lambda to stabilize coefficients. Always evaluate final model on your untouched test set to avoid data-snooping karma.
Closing: Key takeaways (memorize these like exam cheat codes)
- Multiple Linear Regression = OLS in matrix form: beta_hat = (X'X)^{-1} X'y. Geometrically, it's a projection of y onto the span of X.
- Watch out for multicollinearity (unstable betas) and leverage/influential points (hat matrix and Cook’s distance).
- Categorical variables need encoding; interactions let features condition each other.
- Regularize when predictors collude; use proper cross-validation (no leakage) to pick hyperparameters.
Final thought: Linear models are deceptively powerful — simple, interpretable, and often your best baseline. Treat them like the Swiss Army knife of regression: get comfortable with the tool, but don't try to dig a swimming pool with it.
Version: "Multiple Regression: Chaotic Clarity"
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!