Classification I: Logistic Regression and Probabilistic View
Model class probabilities with logistic regression and related probabilistic classifiers.
Content
Decision Boundaries and Geometry
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Decision Boundaries and Geometry — Logistic Regression, But Make It Spatial
"Think of logistic regression as a polite bouncer who gives a probability, not a binary shove. The decision boundary is where they pause mid-judgment and say, ‘hmm… 50–50.’"
Hook: Why should we care about geometry here?
You already know from Maximum Likelihood Estimation (we did the math in the last chapter) that logistic regression fits weights to maximize the probability of the labels. And from Regularized Logistic Regression we learned how penalties like L2/L1 tame those weights. Great. But what does that look like on the plane? How does a vector of weights turn into a line, a curve, or a weirdly shaped region that separates cats from dogs (or spam from not-spam)? This is the geometry chapter — the part where algebra stops being dry and gets spatially dramatic.
Quick reminder (no heavy repeat): the probabilistic formula
For binary logistic regression we model
p(y{=}1 \mid x) = \sigma(\theta^T x) = \frac{1}{1 + e^{-\theta^T x}}
The decision boundary for threshold 0.5 is given by
\theta^T x = 0
That equation is everything. It's the cliff-edge where our model is indifferent.
The basic geometry: linear boundaries in feature space
- In d-dimensional input space, the set {x : θ^T x = 0} is a (d−1)-dimensional hyperplane. In 2D it's a line, in 3D it's a plane, etc.
- The weight vector θ is normal (perpendicular) to the decision hyperplane. That's the single most useful geometric fact.
Imagine θ as an arrow. The decision plane sits perpendicular to that arrow, slicing the space. Points the arrow points toward have positive θ^T x (predicted p > 0.5), points in the opposite direction give negative values (p < 0.5).
Intuition: the knife and the pancake
Picture your dataset as a pancake on the table. The weight vector is a knife stuck straight into the pancake; the decision boundary is the plane of the knife blade. Rotate the knife (change θ direction) and you rotate the line that divides blueberries from chocolate chips.
Intercept and translation
Including an intercept θ0 (bias) corresponds to augmenting x with a constant 1: θ^T x + θ0 = 0. Geometrically, θ0 moves the hyperplane away from the origin — it translates the slice. Changing θ0 slides the line parallel to itself; changing θ (direction) rotates it.
Thresholds other than 0.5: parallel decision boundaries
If you use threshold τ (p≥τ ⇒ class 1), then the boundary is
\theta^T x = \log\frac{\tau}{1-\tau}
That's still a hyperplane — just parallel to the 0.5 boundary. Lowering/increasing τ slides the boundary along the θ direction without rotating it.
Regularization — geometry edition (builds on your Regression II and Regularized Logistic Regression knowledge)
Remember: L2 (ridge) penalizes large weights, L1 (lasso) promotes sparsity. What does that do geometrically?
| Regularizer | Geometric effect on decision boundary | Intuition/When useful |
|---|---|---|
| None | Can produce very steep/tilted hyperplanes (high-magnitude θ) | Fit closely, risk overfitting to weird shapes if you transform features |
| L2 (ridge) | Shrinks θ magnitudes → boundary less steeply sensitive, smoother orientation | Reduces variance, keeps all features but smaller influence |
| L1 (lasso) | Drives some θ components to 0 → boundary aligns with subspace of selected features | Feature selection; boundary may become axis-aligned in transformed space |
- L2 is like attaching a rubber band to the arrow (θ): it resists extreme directions, preferring shorter arrows that still slice the data. Shorter arrow → gentler discrimination.
- L1 is like forcing some components of the arrow to zero with duct tape; the knife ends up only pointing along certain coordinates.
Question: Why does shrinking θ make the boundary "less complex"? Because large θ amplify small differences in feature space into big probability swings (logit magnitudes). Shrink θ → smaller logits → smoother probability surface.
Nonlinear boundaries via feature engineering
Logistic's hyperplane is linear in the feature space, but if you engineer features (polynomials, interactions, kernels), the hyperplane becomes nonlinear in original x.
Example: If you augment (x1,x2) with x1^2 + x2^2, a linear separator in this transformed space can produce a circular boundary in the original space. Geometry + feature transforms = creativity.
Multiclass geometry (softmax and one-vs-rest)
- One-vs-rest (OvR): Each class gets a θ_k. The predicted class is argmax_k θ_k^T x. Decision boundaries between classes k and j satisfy θ_k^T x = θ_j^T x ⇒ (θ_k − θ_j)^T x = 0. So pairwise boundaries are hyperplanes.
- Softmax (multinomial logistic): same idea — pairwise linear boundaries. The regions form convex polyhedra (think Voronoi tessellation with linear facets).
So multiclass logistic with linear scores partitions space into convex regions separated by straight hyperplanes.
Practical geometry exercises (do these in your head or notebook)
- Take θ = [1, −2, 0.5] with x = [1, x1, x2] (bias included). Plot the line θ^T x = 0 in the x1-x2 plane. Which direction is positive? (Answer: follow θ's projection onto x1-x2).
- Increase the magnitude of θ by 10x. What happens to predicted probabilities near the boundary? (Answer: they become sharper — closer to 0 or 1 except in a thinner strip around the boundary.)
- Add a feature x3 = x1^2 + x2^2 and fit a new θ. What shape might the decision boundary be now? (Answer: circular or elliptical if coefficients weight x3 and bias appropriately.)
Why people misunderstand this
- They mix up data space and parameter space. Change θ (parameter space) and you move/rotate the decision boundary (data space). Don’t confuse the sign of θ components with class labels without checking the intercept.
- They think regularization magically “changes model class” — it doesn’t make logistic nonlinear. It just changes how the hyperplane sits.
Closing: key takeaways and the single line to tattoo on your brain
- Decision boundary = θ^T x + θ0 = 0. The weight vector θ is perpendicular to that boundary. Rotate θ → rotate the boundary; change θ0 → slide it.
- L2/L1 don’t change type of boundary (unless you change features); they change the orientation, position, and sensitivity of the boundary by altering θ.
- Feature transformations convert linear separators in feature space to nonlinear separators in original space — geometry is loyal to your feature map.
Final thought: We learned to control complexity in regression with ridge/lasso. Now in classification, those same brakes on θ are our geometric steering wheel: they rotate, shrink, or simplify the slices our model makes in the world. Geometry isn’t decoration — it’s where the rubber meets the data.
Go try: pick a 2D dataset, fit logistic with and without L2 and with a quadratic feature. Plot the boundaries. Then sit back and watch math become visual theater.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!