Data Wrangling with NumPy and Pandas
Transform raw data into analysis-ready datasets using vectorized operations and powerful tabular transformations with NumPy and Pandas.
Content
NumPy Arrays and Vectorization
Versions:
Watch & Learn
NumPy Arrays and Vectorization: Turning Raw Rows Into Rocket Fuel
You pulled millions of rows out of a warehouse like a data heist. Now what? You transform them. Fast. Precisely. With NumPy. Welcome to the T in ETL that your CPU will actually respect.
In the last module, you got data out of databases and lakes (ELT/ETL, warehouses vs. lakes, and the whole ORMs-and-DB-APIs circus). Now we’re inside Python, where latency is a feeling and loops are a trap. NumPy arrays are the foundation of fast numeric computing, and vectorization is the art of turning your operations from one-sad-row-at-a-time to all-rows-at-once. Pandas rides on this. SciPy rides on this. Your future machine learning models kneel before this.
What Is a NumPy Array (and Why Should You Care)?
- A NumPy array is a typed, homogeneously-typed, multi-dimensional container of numbers, laid out in contiguous memory.
- Translation: it’s like a spreadsheet tab that your CPU can devour in one crunchy bite instead of nibbling cell by cell.
- This lets NumPy use compiled, vectorized code under the hood (think C loops that run like the wind), while you write high-level Python that reads like poetry.
The Big Idea
- Python lists = flexible, but slow for math.
- NumPy arrays = rigid (one dtype), but extremely fast.
- Vectorization = write operations that act on entire arrays without explicit Python loops.
import numpy as np
# 1D and 2D arrays with explicit dtypes
a = np.array([1, 2, 3], dtype=np.int32)
b = np.array([[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0]], dtype=np.float64)
print(a.ndim, a.shape, a.dtype) # 1 (3,) int32
print(b.ndim, b.shape, b.dtype) # 2 (2, 3) float64
Lists vs Arrays (The Vibe Check)
| Feature | Python List | NumPy Array |
|---|---|---|
| Type uniformity | No | Yes (single dtype) |
| Memory layout | Dispersed references | Contiguous/strided |
| Speed for math | Slow | Fast (vectorized C under the hood) |
| Broadcasting | No | Yes |
| Best use case | Mixed objects | Numeric data wrangling |
TL;DR: Lists are for groceries. Arrays are for math.
Vectorization: Math Without the For-Loop Hangover
Imagine you pulled a sales table from your data warehouse via SQL. You’ve got a column of prices and a column of quantities. You want revenue, then apply a discount, then sales tax. You could loop. Or you could vectorize like a legend.
prices = np.array([12.99, 5.49, 3.99, 100.00])
qty = np.array([ 10, 3, 5, 1])
revenue = prices * qty # elementwise multiply
discount_rate = 0.10
sales_tax = 0.075
net = revenue * (1 - discount_rate) * (1 + sales_tax)
No loops. No sadness. Just results.
Pro move: In production pipelines (ELT), you often land raw data in a lake/warehouse, then transform inside Python for modeling or feature engineering. Vectorization is the difference between minutes and hours.
Broadcasting: Arrays That Get Along Even When Shapes Don’t
Broadcasting lets NumPy automatically expand shapes to make elementwise operations possible. Rules (simplified):
- Compare dimensions from right to left.
- Dimensions match if they are equal or one of them is 1.
- If not matchable, NumPy throws an error faster than you can say "ValueError: operands could not be broadcast together".
X = np.array([[1., 2., 3.],
[4., 5., 6.],
[7., 8., 9.]]) # shape (3, 3)
col_bias = np.array([0.1, 0.2, 0.3]) # shape (3,)
row_bias = np.array([[10.0], [20.0], [30.0]]) # shape (3, 1)
Y = X + col_bias + row_bias # (3,3) + (3,) + (3,1) -> (3,3)
Why do people keep misunderstanding this? Because they don’t check shapes. Print shapes like a detective.
print(X.shape, col_bias.shape, row_bias.shape)
UFuncs, Aggregations, and the Axis Parameter
NumPy’s ufuncs (universal functions) do fast, elementwise operations in C. Examples: np.add, np.sqrt, np.exp.
x = np.array([1., 4., 9.])
root = np.sqrt(x) # array([1., 2., 3.])
Aggregations collapse dimensions: np.sum, np.mean, np.max, etc. The axis argument defines which direction you squeeze.
axis=0= down the rows, per columnaxis=1= across columns, per row
A = np.arange(12, dtype=float).reshape(3, 4) # shape (3,4)
col_means = A.mean(axis=0) # shape (4,)
row_sums = A.sum(axis=1) # shape (3,)
Example: Z-Score Normalization per Column
X = np.array([[1., 2., 3.],
[4., 5., 6.],
[7., 8., 9.]]) # shape (3,3)
mu = X.mean(axis=0, keepdims=True) # shape (1,3)
sig = X.std(axis=0, ddof=0, keepdims=True)
Z = (X - mu) / sig # broadcasting does the magic
Keep
keepdims=Trueif you want the result to be broadcastable without reshaping. It’s a vibe and a safety feature.
Boolean Masking and Fancy Indexing
Filter first, ask questions never.
revenue = np.array([129.9, 16.47, 19.95, 100.0, 4.99])
high = revenue > 20
print(high) # array([ True, False, False, True, False])
print(revenue[high]) # array([129.9, 100. ])
- Boolean masks produce copies.
- Fancy indexing with integer arrays also produces copies.
- Plain slices (like
a[1:4]) produce views.
a = np.arange(10)
view = a[2:6]
view[:] = -1
print(a) # changes reflected: [0 1 -1 -1 -1 -1 6 7 8 9]
mask = a < 0
subset = a[mask] # copy
subset[:] = 999
print(a) # still has negatives, original unaffected by changing subset
Memorize this: slicing = view, masking/fancy indexing = copy.
Dtypes, Casting, and the NaN Dragon
- Choose dtypes intentionally:
int32,float64,bool,datetime64[ns](yes, date magic is real). - Mixed operations upcast:
int + float -> float. NaNspreads like rumors; usenp.nanmean,np.nanstd, etc.
arr = np.array([1, 2, 3], dtype=np.int32)
arr = arr.astype(np.float64) # explicit upgrade
x = np.array([1.0, np.nan, 3.0])
print(np.nanmean(x)) # ignores NaN
In analytics from lakes/warehouses, missing values happen. Treat them on purpose, not by accident.
Performance Notes You’ll Wish You Knew Earlier
- Avoid Python loops. Each iteration adds overhead like a toll booth.
- Use in-place ops when safe:
a *= 1.1instead ofa = a * 1.1to reduce temp arrays. - Beware of repeated broadcasting creating big temporaries. Sometimes compute in steps or use
np.add(a, b, out=a). - Use the new
np.random.default_rng()for fast, reproducible randomness.
rng = np.random.default_rng(42)
X = rng.normal(size=(1_000, 100))
# Compare mental models
# BAD (Python loop):
# for i in range(X.shape[0]): X[i] = (X[i] - X[i].mean()) / X[i].std()
# GOOD (vectorized across axis):
mu = X.mean(axis=1, keepdims=True)
sd = X.std(axis=1, keepdims=True)
Z = (X - mu) / (sd + 1e-9)
If you see
np.vectorize, know it’s a convenience wrapper, not true speed. It’s a fancy for-loop with lipstick.
Memory Layout, Reshape, and When Copies Happen
- Arrays can be C-order (row-major) or F-order (column-major). Most are C-order by default.
reshapereturns a view when possible (same data, new shape). If impossible, a copy is made.ravelprefers a view;flattenalways copies.
A = np.arange(12).reshape(3,4)
B = A.T # transpose (view with different strides)
print(A.flags['C_CONTIGUOUS'], B.flags['C_CONTIGUOUS']) # True, False
C = A.ravel() # likely a view
D = A.flatten() # guaranteed copy
You don’t need to master strides on day one, but knowing views vs. copies will save both memory and dignity.
From SQL Rows to NumPy Arrays to Pandas (The Interop Reality)
You used Python DB APIs/ORMs to extract data. Now, clean/transform with NumPy; then hand it to Pandas or scikit-learn.
import pandas as pd
# Imagine df came from read_sql or a warehouse extract
df = pd.DataFrame({
'price': [12.99, 5.49, 3.99, 100.0],
'qty': [10, 3, 5, 1],
})
values = df.to_numpy() # preferred over .values for clarity
mu = values.mean(axis=0, keepdims=True)
std = values.std(axis=0, keepdims=True)
standardized = (values - mu) / (std + 1e-9)
df_std = pd.DataFrame(standardized, columns=df.columns)
Pandas uses NumPy under the hood. Master arrays and vectorization, and Pandas stops feeling like wizardry and starts feeling like a well-behaved intern.
Common Pitfalls (A Short Roast)
- Using Python’s
sumon arrays. Usenp.sum(faster, respects axis, dtype). - Shape mismatches in broadcasting. Print
.shapebefore vibes. - Integer division surprises. In Python 3,
/gives float,//floors. - In-place ops that can’t cast:
int_array *= 1.1may raise or silently cast depending on settings. Convert first. - Assuming mask edits edit the original. Masked/fancy indexing returns copies.
- Forgetting NaNs. They will sabotage your means until you use
np.nanmean.
Quick Mental Models
- Think in blocks, not cells.
- Align shapes, then let broadcasting do the heavy lifting.
- Reduce along axes to summarize, expand dims (or keepdims) to align for transforms.
- Prefer ufuncs and aggregations; avoid custom Python loops for numeric ops.
Mini-Challenges (Try These!)
- You have
temps = rng.normal(20, 5, size=(365, 5))representing 5 cities. Compute per-city z-scores without loops. - Given
salesshape(n_days, n_products)anddiscountsof shape(n_products,), apply per-product discounts via broadcasting. - Replace negative values in an array with the column mean using
np.whereandkeepdims=True.
# 3. Replace negatives with column mean
A = rng.normal(size=(4, 3))
col_mean = A.mean(axis=0, keepdims=True)
A_clean = np.where(A < 0, col_mean, A)
Wrap-Up: The TL;DR You Can Tape to Your Monitor
- NumPy arrays are typed, contiguous, and fast. They’re the engine under Pandas and ML libraries.
- Vectorization replaces slow Python loops with fast C-accelerated array ops.
- Broadcasting + axis-aware reductions = elegant, scalable transformations.
- Views vs. copies matter for performance and correctness.
- Handle dtypes and NaNs on purpose.
Big insight: Your warehouse got you the data. NumPy gets you the transformation. Pandas will make it pretty. Models will make it predictive. But vectorization is what makes it possible at scale.
Next up: Pandas DataFrames, where we’ll take these array superpowers and wield them across labeled columns like benevolent spreadsheet warlocks.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!