Data Handling with NumPy and Pandas
Manipulate arrays and tabular data efficiently using NumPy, Pandas, and basic visualization.
Content
Vectorization Patterns
Versions:
Watch & Learn
AI-discovered learning video
Vectorization Patterns — Fast, Fancy, and Slightly Theatrical
"If your Python code has a loop over array elements, somewhere a NumPy array just sighed." — Probably Me
Opening: a tiny existential question
You already know what a NumPy array is and how broadcasting shimmies shapes together (we covered NumPy Arrays and Broadcasting Rules). You also read the math textbook of your nightmares — linear algebra, probability, calculus — so the language of vectors and matrices is familiar. Great. Now we learn how to stop treating arrays like lists with fancy packaging and start treating them like the optimized numerical beasts they are.
Why this matters: Vectorization is the difference between code that finishes in seconds and code that leaves you time to go outside, or at least make a second coffee. For AI and ML pipelines, this is the difference between prototyping and production.
Main Content
What is vectorization, actually?
- Vectorization = expressing operations on whole arrays (vectors/matrices/tensors) at once, instead of element-by-element in Python loops.
- This uses low-level, compiled code (C/Fortran/SIMD) under the hood (NumPy ufuncs, BLAS/LAPACK), so it’s way faster.
Think of loops as walking through a crowd handing out flyers one-by-one. Vectorization is hiring a drone that drops a bundle of flyers across the crowd in a single pass.
Core vectorization patterns (with tiny recipes)
- Elementwise arithmetic (the bread-and-butter)
import numpy as np
x = np.arange(1_000_000, dtype=np.float64)
# not: [x[i]*2 for i in range(len(x))]
y = 2 * x + 3 # vectorized, uses ufuncs, super fast
Key: use ufuncs (+, -, *, /, **, np.log, np.exp, etc.). Prefer out= when chaining to avoid temporaries:
np.multiply(x, 2, out=x) # in-place, be careful!
- Broadcasting choreography (you already know rules)
Broadcasting lets a (n,1) array behave like (n,m) for operations. Use it for adding biases, scaling columns, etc.
X = np.random.randn(1000, 50) # data
bias = np.random.randn(50) # shape (50,)
X_plus_bias = X + bias # broadcasts bias across rows
- Reductions and axis-aware ops
Use np.sum, np.mean, np.max, np.std with axis= to collapse dimensions efficiently.
- Masking and boolean indexing
mask = X[:, 0] > 0
X_pos = X[mask] # selects rows where first column > 0
Use np.where for vectorized conditional choices:
z = np.where(X[:,0] > 0, X[:,1], 0.0)
- Linear algebra and contractions: matmul, tensordot, einsum
For ML math (recall Math for ML): matrix multiply and tensor contractions are vectorized core operations.
A = np.random.randn(512, 256)
B = np.random.randn(256, 128)
C = A @ B # uses BLAS
# or complex contraction
D = np.einsum('ij,kj->ik', A, B) # powerful and readable once you learn it
- Fancy indexing and grouping (Pandas-style patterns)
Pandas offers vectorized group transforms via groupby().transform() and merge() instead of Python loops.
import pandas as pd
df = pd.DataFrame({'id': [1,1,2,2], 'x':[10,20,5,7]})
df['x_centered'] = df['x'] - df.groupby('id')['x'].transform('mean')
When not to use np.vectorize
np.vectorize is syntactic sugar — it wraps Python loops. It makes code look vectorized but is not faster. Use numba.njit or write a ufunc in C if you need speed for custom functions.
Pro tip: If your custom operation can't be expressed with ufuncs/einsum/matrix ops, try numba. If numba isn't feasible, accept the loop.
Performance patterns & pitfalls (because nuance matters)
| Pattern | Speed | Memory notes | When to use |
|---|---|---|---|
| Python loop | Slow | Low mem if streaming | Tiny arrays or complex control flow |
| NumPy ufuncs (+ broadcasting) | Fast | Low temporaries if using out= | Default for numeric math |
| np.einsum / matmul | Very fast (BLAS) | May require contiguity | Linear algebra, tensor contractions |
| np.vectorize | Same as loop | Same | Only for convenience — not perf |
| numba / Cython | Fast | Low | Custom kernels; highest effort |
| Pandas vectorized methods | Fast-ish | Index alignment overhead | Tabular ops, grouping, string/datetime ops |
Common gotchas:
- Temporary arrays: chained operations like
a = (X * 2) + (Y * 3)allocate temporaries. Useout=ornp.addwithoutto reduce allocations. - Contiguity & strides: non-contiguous arrays are slower. Use
np.ascontiguousarray()for critical kernels. - dtype promotion: mixing ints and floats can cause implicit casts and copies.
- Views vs copies: boolean indexing returns a copy; modifying it won’t change original. In Pandas, watch
SettingWithCopyWarning.
Pandas-specific vectorization patterns
- Use
.to_numpy()or.valuesto drop to NumPy when doing heavy numeric work (faster, less overhead). - Use
.assign()andtransform()to keep operations chainable and efficient. - For group-wise operations, prefer
groupby().transform()over Python loops. - Use categorical dtypes for repeated string/label columns to speed groupby/joins and reduce memory.
- Use
.strand.dtaccessors for vectorized string/datetime ops (they are implemented in C).
Example: vectorized feature creation
df['hour'] = pd.to_datetime(df['ts']).dt.hour # fast vectorized extraction
df['is_high'] = np.where(df['value'] > df['value'].quantile(0.9), 1, 0)
A small, realistic example: batch-normalize rows
We want to row-normalize a 2D batch matrix X so each row has mean 0 and std 1 (no loops):
X = np.random.randn(1024, 512)
row_mean = X.mean(axis=1, keepdims=True)
row_std = X.std(axis=1, keepdims=True)
X_norm = (X - row_mean) / (row_std + 1e-8)
No loops. No drama. Broadcasting does the heavy lifting: subtracts each row's mean from its elements.
Closing: key takeaways & challenge
- Vectorize early, loop rarely. Use ufuncs, broadcasting, einsum, and BLAS-backed matmul.
- Watch memory. Temporaries, dtype casts, and non-contiguous arrays can kill performance.
- Pandas = vectorized tabular ops. Use groupby.transform, categorical dtypes, and .to_numpy when you need raw speed.
- If you must custom compute, prefer numba over np.vectorize. np.vectorize is a lie that looks pretty.
Final brain-tickle: imagine your ML model as a factory. Vectorization is switching from hand-assembling parts to conveyor belts and robots. More throughput, fewer typos, and you finally get time to refactor that other code that’s been haunting you.
Challenge (do it in one hour): take a small ML preprocessing script that loops over rows and convert it to a vectorized NumPy/Pandas version. Time it before and after. Post the results and maybe a screenshot of your surprised face when it finishes 10x faster.
"The best vectorized code is like good lighting in a movie: you don't notice it, you just feel the difference." — Your future faster codebase
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!