Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

Bernoulli and Binomial Likelihood Link Functions and the Logit Maximum Likelihood Estimation Regularized Logistic Regression Decision Boundaries and Geometry One-vs-Rest and Multinomial Logistic Class Probability Estimation Feature Scaling and Convergence Interpreting Coefficients and Odds Ratios Handling Linearly Separable Data Class Weights and Cost-Sensitive Learning Baseline and Dummy Classifiers Naive Bayes Classifiers Overfitting in Logistic Models Sparse High-Dimensional Settings

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Classification I: Logistic Regression and Probabilistic View

Classification I: Logistic Regression and Probabilistic View

23100 views

Model class probabilities with logistic regression and related probabilistic classifiers.

Content

5 of 15

Decision Boundaries and Geometry

Geometry but Make It Dramatic

2847 views

intermediate

humorous

visual

machine learning

gpt-5-mini

2847 views

Versions:

Geometry but Make It Dramatic

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Decision Boundaries and Geometry — Logistic Regression, But Make It Spatial

"Think of logistic regression as a polite bouncer who gives a probability, not a binary shove. The decision boundary is where they pause mid-judgment and say, ‘hmm… 50–50.’"

Hook: Why should we care about geometry here?

You already know from Maximum Likelihood Estimation (we did the math in the last chapter) that logistic regression fits weights to maximize the probability of the labels. And from Regularized Logistic Regression we learned how penalties like L2/L1 tame those weights. Great. But what does that look like on the plane? How does a vector of weights turn into a line, a curve, or a weirdly shaped region that separates cats from dogs (or spam from not-spam)? This is the geometry chapter — the part where algebra stops being dry and gets spatially dramatic.

Quick reminder (no heavy repeat): the probabilistic formula

For binary logistic regression we model

p(y{=}1 \mid x) = \sigma(\theta^T x) = \frac{1}{1 + e^{-\theta^T x}}

The decision boundary for threshold 0.5 is given by

\theta^T x = 0

That equation is everything. It's the cliff-edge where our model is indifferent.

The basic geometry: linear boundaries in feature space

In d-dimensional input space, the set {x : θ^T x = 0} is a (d−1)-dimensional hyperplane. In 2D it's a line, in 3D it's a plane, etc.
The weight vector θ is normal (perpendicular) to the decision hyperplane. That's the single most useful geometric fact.

Imagine θ as an arrow. The decision plane sits perpendicular to that arrow, slicing the space. Points the arrow points toward have positive θ^T x (predicted p > 0.5), points in the opposite direction give negative values (p < 0.5).

Intuition: the knife and the pancake

Picture your dataset as a pancake on the table. The weight vector is a knife stuck straight into the pancake; the decision boundary is the plane of the knife blade. Rotate the knife (change θ direction) and you rotate the line that divides blueberries from chocolate chips.

Intercept and translation

Including an intercept θ0 (bias) corresponds to augmenting x with a constant 1: θ^T x + θ0 = 0. Geometrically, θ0 moves the hyperplane away from the origin — it translates the slice. Changing θ0 slides the line parallel to itself; changing θ (direction) rotates it.

Thresholds other than 0.5: parallel decision boundaries

If you use threshold τ (p≥τ ⇒ class 1), then the boundary is

\theta^T x = \log\frac{\tau}{1-\tau}

That's still a hyperplane — just parallel to the 0.5 boundary. Lowering/increasing τ slides the boundary along the θ direction without rotating it.

Regularization — geometry edition (builds on your Regression II and Regularized Logistic Regression knowledge)

Remember: L2 (ridge) penalizes large weights, L1 (lasso) promotes sparsity. What does that do geometrically?

Regularizer	Geometric effect on decision boundary	Intuition/When useful
None	Can produce very steep/tilted hyperplanes (high-magnitude θ)	Fit closely, risk overfitting to weird shapes if you transform features
L2 (ridge)	Shrinks θ magnitudes → boundary less steeply sensitive, smoother orientation	Reduces variance, keeps all features but smaller influence
L1 (lasso)	Drives some θ components to 0 → boundary aligns with subspace of selected features	Feature selection; boundary may become axis-aligned in transformed space

L2 is like attaching a rubber band to the arrow (θ): it resists extreme directions, preferring shorter arrows that still slice the data. Shorter arrow → gentler discrimination.
L1 is like forcing some components of the arrow to zero with duct tape; the knife ends up only pointing along certain coordinates.

Question: Why does shrinking θ make the boundary "less complex"? Because large θ amplify small differences in feature space into big probability swings (logit magnitudes). Shrink θ → smaller logits → smoother probability surface.

Nonlinear boundaries via feature engineering

Logistic's hyperplane is linear in the feature space, but if you engineer features (polynomials, interactions, kernels), the hyperplane becomes nonlinear in original x.

Example: If you augment (x1,x2) with x1^2 + x2^2, a linear separator in this transformed space can produce a circular boundary in the original space. Geometry + feature transforms = creativity.

Multiclass geometry (softmax and one-vs-rest)

One-vs-rest (OvR): Each class gets a θ_k. The predicted class is argmax_k θ_k^T x. Decision boundaries between classes k and j satisfy θ_k^T x = θ_j^T x ⇒ (θ_k − θ_j)^T x = 0. So pairwise boundaries are hyperplanes.
Softmax (multinomial logistic): same idea — pairwise linear boundaries. The regions form convex polyhedra (think Voronoi tessellation with linear facets).

So multiclass logistic with linear scores partitions space into convex regions separated by straight hyperplanes.

Practical geometry exercises (do these in your head or notebook)

Take θ = [1, −2, 0.5] with x = [1, x1, x2] (bias included). Plot the line θ^T x = 0 in the x1-x2 plane. Which direction is positive? (Answer: follow θ's projection onto x1-x2).
Increase the magnitude of θ by 10x. What happens to predicted probabilities near the boundary? (Answer: they become sharper — closer to 0 or 1 except in a thinner strip around the boundary.)
Add a feature x3 = x1^2 + x2^2 and fit a new θ. What shape might the decision boundary be now? (Answer: circular or elliptical if coefficients weight x3 and bias appropriately.)

Why people misunderstand this

They mix up data space and parameter space. Change θ (parameter space) and you move/rotate the decision boundary (data space). Don’t confuse the sign of θ components with class labels without checking the intercept.
They think regularization magically “changes model class” — it doesn’t make logistic nonlinear. It just changes how the hyperplane sits.

Closing: key takeaways and the single line to tattoo on your brain

Decision boundary = θ^T x + θ0 = 0. The weight vector θ is perpendicular to that boundary. Rotate θ → rotate the boundary; change θ0 → slide it.
L2/L1 don’t change type of boundary (unless you change features); they change the orientation, position, and sensitivity of the boundary by altering θ.
Feature transformations convert linear separators in feature space to nonlinear separators in original space — geometry is loyal to your feature map.

Final thought: We learned to control complexity in regression with ridge/lasso. Now in classification, those same brakes on θ are our geometric steering wheel: they rotate, shrink, or simplify the slices our model makes in the world. Geometry isn’t decoration — it’s where the rubber meets the data.

Go try: pick a 2D dataset, fit logistic with and without L2 and with a quadratic feature. Plot the boundaries. Then sit back and watch math become visual theater.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics