Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

Simple Linear Regression Geometry Multiple Linear Regression Formulation Assumptions and Diagnostics Ordinary Least Squares Solution Gradient Descent for OLS Heteroscedasticity and Robust Losses Transformations of Targets and Features Categorical Variables in Regression Interaction Terms in Linear Models Multicollinearity and VIF Prediction Intervals vs Confidence Intervals Feature Scaling Effects in OLS Handling Outliers with Huber and Quantile Loss Model Interpretation with Coefficients Baseline and Dummy Regressors

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Regression I: Linear Models

Regression I: Linear Models

24992 views

Build and diagnose linear regression models, understand assumptions, and evaluate predictive performance.

Content

2 of 15

Multiple Linear Regression Formulation

Multiple Regression: Chaotic Clarity

5002 views

intermediate

humorous

statistics

education

gpt-5-mini

5002 views

Versions:

Multiple Regression: Chaotic Clarity

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Regression I: Linear Models — Multiple Linear Regression Formulation

"If one line explained your world, multiple lines explain your apartment complex." — A disgruntled apartment manager, probably a statistician

Opening: Why this matters (without repeating the basics)

You already know Simple Linear Regression: a single predictor, a slope, a line, and a geometric picture where we project y onto the span of a single x (recall Simple Linear Regression Geometry). Now imagine you have not one annoying roommate (predictor) but a whole cast of characters: age, income, education, hours of sleep, and whether they own a plant. Multiple Linear Regression (MLR) is the thing that figures out how each roommate contributes to the rent bill — conditional on the others.

This section builds on the geometry intuition and the evaluation practices from our cross-validation module: we'll use matrix geometry to see OLS as projection, talk about assumptions and pitfalls (hello, multicollinearity and leakage), and note how regularization + proper cross-validation is your best friend when features start to gossip with each other.

What MLR is (fast, then deep)

Multiple Linear Regression models a continuous response y as a linear combination of p predictors x1, x2, ..., xp plus noise. In compact matrix form:

y = X beta + eps

y is an n×1 vector of responses
X is an n×(p+1) design matrix (first column usually all ones for the intercept)
beta is a (p+1)×1 vector of coefficients
eps is noise (zero mean, usually assumed homoskedastic and uncorrelated)

The ordinary least squares (OLS) estimate minimizes the sum of squared residuals and has the closed-form solution:

beta_hat = (X'X)^{-1} X' y

Geometric reminder: beta_hat produces the projection of y onto the column space of X (the span of your predictors). If X's columns are nearly collinear, that column space is squishy and the projection becomes unstable.

Step-by-step: From formulation to practice

Design matrix construction
- Include an intercept column of ones unless you have a reason not to.
- For categorical variables, use dummy/one-hot encoding and be mindful of the dummy-variable trap (drop one level to avoid perfect multicollinearity).
Fit OLS (closed form)

# NumPy-style pseudocode
X = add_intercept(X_raw)
beta_hat = np.linalg.inv(X.T @ X) @ X.T @ y

Predictions and residuals

y_hat = X @ beta_hat
residuals = y - y_hat

Geometry and diagnostics
- Hat matrix H = X (X'X)^{-1} X' maps y to y_hat. Diagonal entries h_ii are leverages.
- Large leverage + large residual → influential point (Cook's distance helps quantify this).

Assumptions (the checklist you’ll pretend to skim)

Linearity: Relationship between predictors and response is linear in parameters.
Independence: Observations are independent (watch time series and clustered data).
Homoskedasticity: Constant variance of errors.
No perfect multicollinearity: X'X invertible (or at least full rank).
Normally distributed errors (for exact finite-sample inference; not required for unbiasedness).

Pro tip: If X'X is nearly singular, parameter estimates blow up (high variance). Regularize or remove redundant features.

Common wrinkles and what to do about them

Multicollinearity: predictors correlated with each other.
- Effect: large standard errors, unstable coefficients (signs flip like a weather vane).
- Fixes: remove/recombine features, PCA, or use regularization (Ridge, Lasso).
Interactions & nonlinearity:
- Add interaction terms (x1*x2) or polynomial terms (x^2) — but remember: this increases p, so watch overfitting.
Categorical predictors:
- Use one-hot encoding; drop one level to keep X full rank.
Feature scaling:
- OLS doesn't require scaling for correctness, but scaling helps interpretation and is essential for regularized methods.
Leakage & data-snooping:
- Never fit or transform (e.g., scale, PCA) on the full dataset before splitting. Use your train/validation/test pipeline like we said in the cross-validation lecture.
- If using cross-validation to tune regularization strength, ensure the fold splits are correct (stratify if needed, avoid time leakage, etc.).

Where regularization enters the party (and how it connects to CV)

Ridge (L2) shrinks coefficients to reduce variance when predictors chatter with each other:

beta_ridge = (X'X + lambda I)^{-1} X' y

Lambda controls the tradeoff bias ↔ variance. How pick lambda? Cross-validation — but do it properly: compute lambda via CV on the training data only (no peeking at test). This is exactly the evaluation discipline we emphasized earlier.

Pseudocode for CV tuning (sketch):

for lambda in grid:
    cv_scores = []
    for train_idx, val_idx in folds:
        fit ridge on X[train_idx], y[train_idx]
        score on X[val_idx], y[val_idx]
    cv_scores.append(mean(cv_scores))
select lambda with best cv score
refit on full training set with that lambda
evaluate on held-out test

Quick comparison table

Concept	Simple Linear	Multiple Linear	With Ridge
Parameters	slope + intercept	p slopes + intercept	p slopes + intercept shrunk
Geometry	projection onto 1-dim subspace	projection onto p-dim column space	biased projection but lower variance
Problems	outliers affect fit	multicollinearity, leverage points	mitigates multicollinearity

Mini example (real-world flavor)

Imagine predicting house prices using area, bedrooms, age, and distance to subway. Area and bedrooms are correlated (big houses usually have more bedrooms). In OLS, coefficients for bedrooms might be noisy or even negative — a telltale sign of multicollinearity. Use Ridge and cross-validate lambda to stabilize coefficients. Always evaluate final model on your untouched test set to avoid data-snooping karma.

Closing: Key takeaways (memorize these like exam cheat codes)

Multiple Linear Regression = OLS in matrix form: beta_hat = (X'X)^{-1} X'y. Geometrically, it's a projection of y onto the span of X.
Watch out for multicollinearity (unstable betas) and leverage/influential points (hat matrix and Cook’s distance).
Categorical variables need encoding; interactions let features condition each other.
Regularize when predictors collude; use proper cross-validation (no leakage) to pick hyperparameters.

Final thought: Linear models are deceptively powerful — simple, interpretable, and often your best baseline. Treat them like the Swiss Army knife of regression: get comfortable with the tool, but don't try to dig a swimming pool with it.

Version: "Multiple Regression: Chaotic Clarity"

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics