Math for Machine Learning
The mathematical pillars underpinning models, optimization, and inference.
Content
Linear algebra
Versions:
Watch & Learn
Linear Algebra for Machine Learning: Shape-Checked Superpowers
If virtual environments keep your Python dependencies from fighting, linear algebra keeps your features from beefing. Same energy, more geometry.
You just survived virtual envs, packaging, and type hints. Congrats. Now we are upgrading your brain with the math that lets ML actually move. Linear algebra is the silent infrastructure under your numpy arrays, neural nets, and PCA plots. It is the part of the story where shapes matter, and the code finally aligns with geometry.
What we are doing
- Turning vectors and matrices into actual characters with jobs
- Reading shapes like type hints for math
- Making peace with dot products, projections, eigen-things, and the SVD
- Connecting the math to NumPy code you can run without crying
Vectors and matrices: the cast
- Vector: ordered list of numbers. Think feature vector: one row of your dataset.
- Matrix: grid of numbers. Think dataset: n_samples by n_features.
- Tensor: multi-dimensional array. Think images: height × width × channels.
In ML land:
- A linear layer in a neural network? That is just matrix multiply plus bias.
- A dataset X with shape (n, d)? Every row is a vector in R^d. You are living in a d-dimensional world with n experiences.
Core idea: linear algebra gives you a language for transforming spaces. Matrices are functions that take vectors in, spit vectors out, and keep lines straight.
Shapes are type hints for math
You learned to use type hints in Python to catch shape-ish mistakes. Do the same in math.
- Dot product: (d,) · (d,) → scalar
- Matrix-vector: (m, d) @ (d,) → (m,)
- Matrix-matrix: (m, d) @ (d, n) → (m, n)
Try to multiply incompatible shapes and numpy will roast you.
import numpy as np
A = np.random.randn(100, 300)
B = np.random.randn(128, 200)
try:
A @ B
except Exception as e:
print('Shape error:', e)
Mental model: shape-checking is like type checking. If types do not align, the math gods veto.
Dot products, norms, and why cosine keeps showing up
Dot product x · y = sum_i x_i y_i
- Geometry: length of x times length of y times cos(theta)
- ML: similarity measure; cosine similarity normalizes away scale
Norms measure vector size
- L2 norm: ||x||_2 = sqrt(sum x_i^2). Smooth, angle-friendly
- L1 norm: ||x||_1 = sum |x_i|. Sparse-friendly, robust to outliers
Normalize features to avoid scale chaos:
X = X / (np.linalg.norm(X, axis=1, keepdims=True) + 1e-8) # unit vectors per row
Matrix multiply: composition of linear moves
- Matrix columns are the images of basis vectors. Multiplying A @ x means taking a weighted combo of A's columns with weights from x.
- Composition: B @ A means do A first, then B. Order matters; math is petty about that too.
Identity and inverse:
- I is the do-nothing transformation
- A^{-1} undoes A (if it exists)
- Do not compute inverses directly in code; use solvers
# Solve Ax = b without forming A^{-1}
A = np.random.randn(500, 500)
b = np.random.randn(500)
x = np.linalg.solve(A, b)
Pro tip: friends do not let friends compute matrix inverses. Use solve or factorization.
Span, rank, and dimension: how many directions matter
- Span: all combinations of some vectors. The vibes those vectors can collectively create.
- Rank: number of independent columns. How many genuine directions live in your data.
- Full rank means columns are independent; low rank means redundancy.
ML link:
- If X has low rank, your features are repetitive. PCA and SVD feast on this.
Orthogonality and projection: the geometry of least squares
Two vectors are orthogonal if x · y = 0. Orthonormal bases are the comfy sweatpants of linear algebra: everything is easy.
Projection of y onto subspace spanned by columns of A solves the least-squares problem:
- Minimize ||Ax − y||_2
- Normal equations: A^T A x = A^T y
- Better numerics with QR or SVD:
# Least squares without normal equations
x, residuals, rank, s = np.linalg.lstsq(A, y, rcond=None)
Interpretation: you cannot fit y exactly, so you drop the perpendicular from y onto the column space of A. That foot of the perpendicular is Ax_hat.
Eigenvalues, eigenvectors, and the SVD: the hype trio
Eigenvectors v of A satisfy A v = λ v
- Directions A only scales, not rotates
- For symmetric matrices (hello, covariance), eigenvectors are orthogonal
SVD A = U Σ V^T
- U: orthonormal left singular vectors (output directions)
- Σ: nonnegative singular values (strengths)
- V: orthonormal right singular vectors (input directions)
Why you care:
- PCA is literally the SVD of centered data X: principal components are columns of V, variances are Σ^2 / (n − 1)
- Low-rank approximation: keep top k singular values; compress and denoise
# PCA via SVD
Xc = X - X.mean(axis=0, keepdims=True)
U, S, Vt = np.linalg.svd(Xc, full_matrices=False)
Z = Xc @ Vt.T[:, :k] # projected data in k-dim PC space
If you can whisper SVD in your sleep, you can explain 80 percent of classical ML.
Tensors, broadcasting, and einsum: vectorize your life
- Broadcasting lets you add shapes like (n, d) + (1, d). NumPy stretches the singleton dimension.
- Einstein summation compresses complicated sums into a clean spec.
# Batch dot products with einsum: for X (n,d) and W (d,m)
Y = np.einsum('nd,dm->nm', X, W)
# Cosine similarities of all pairs in X
Xn = X / (np.linalg.norm(X, axis=1, keepdims=True) + 1e-9)
S = np.einsum('nd,md->nm', Xn, Xn)
Think of broadcasting like type hints from earlier: it is shape arithmetic. If the implicit stretches are unclear, make them explicit with keepdims.
Numerical sanity: conditioning beats bravado
- Condition number κ(A) measures sensitivity. Large κ means small data wiggles cause big solution wiggles.
- Standardize features before fitting. Centering helps PCA and SVD.
- Prefer factorizations over inverses: Cholesky for SPD matrices (e.g., A^T A), QR for least squares, SVD for rank-deficient cases.
# Cholesky solve for symmetric positive definite matrices
G = A.T @ A + 1e-6 * np.eye(A.shape[1])
r = A.T @ y
L = np.linalg.cholesky(G)
z = np.linalg.solve(L, r)
x = np.linalg.solve(L.T, z)
Sparse and structured matrices: speed without losing soul
Many ML matrices are sparse: graphs, bag-of-words, one-hot features. Use sparse ops.
import scipy.sparse as sp
A = sp.random(10000, 10000, density=1e-4, format='csr')
x = np.random.randn(10000)
y = A @ x # fast sparse matvec
Structure matters: Toeplitz and convolutional matrices become fast FFTs; diagonal and block-diagonal matrices make parallelization easy.
Quick map: operation, code, meaning
| Concept | Code | ML meaning |
|---|---|---|
| Dot product | x @ y | similarity, projection length |
| L2 norm | np.linalg.norm(x) | feature scaling, regularization |
| Matrix-vector | A @ x | linear layer, feature mixing |
| Least squares | np.linalg.lstsq(A, y) | regression fit |
| PCA via SVD | np.linalg.svd(Xc) | dimensionality reduction |
| Pseudoinverse | np.linalg.pinv(A) | minimum-norm solution |
| Projection | A @ np.linalg.lstsq(A, y)[0] | data onto subspace |
From Python habits to math habits
- Virtual envs isolate dependencies; subspaces isolate directions. Keep your solution inside the right subspace.
- Packaging taught you modularity; build models as compositions of linear maps with clean interfaces.
- Type hints saved you from runtime pain; shape-checking saves you from math pain. Annotate mentally: R^{n×d} @ R^{d×m} → R^{n×m}.
Closing: the elevator pitch your future self needs
Linear algebra is not just about moving numbers; it is about moving spaces. Vectors are states, matrices are actions, and norms are the metrics for whether you made things better or worse. Dot products measure alignment, projections reconcile reality with your model, and the SVD reveals the underlying choreography of your data.
Key takeaways:
- Think in shapes; let them be your type hints.
- Solve, do not invert; factorization is your friend.
- SVD and PCA are the same sitcom with different lighting; learn one, understand both.
- Orthogonality makes life easy; seek orthonormal bases when you can.
- Numerical stability is a feature, not a vibe. Standardize, regularize, and condition-check.
Walk-away insight: when a model works, it is often because you aligned your data with the right directions and compressed away the noise. That is linear algebra’s love language.
Next up: vector calculus and gradients — the part where these linear moves assemble into nonlinear magic and backprop quietly applies the chain rule like a legend.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!