Data Analysis with pandas
Manipulate and analyze tabular data using pandas for indexing, joins, time series, and robust I/O.
Content
Indexing and Selection
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
pandas Indexing and Selection — Stop Staring, Start Slicing
You already know Series and DataFrames, and you’ve flirted with NumPy’s broadcasting magic. Now it’s time to master the manners of pandas: how to ask for exactly the rows and columns you want without accidentally cloning the Titanic or creating a copy-on-write hydra.
“Indexing is where pandas turns from ‘magical spreadsheet’ into ‘surgical scalpel.’”
Why indexing matters (and why your code will thank you)
Indexing and selection let you: filter rows, pick columns, slice by labels or positions, and do fast boolean or fancy indexing — all while keeping performance and clarity. Good indexing reduces bugs, makes code readable, and prevents the heartbreak of the infamous SettingWithCopyWarning.
This lesson assumes you know: Series/DataFrame basics and NumPy arrays (vectorized ops, broadcasting). We'll build on NumPy ideas (position-based thinking) and DataFrame ideas (labels, dtypes, alignment).
The Big Four: .loc, .iloc, [], and boolean indexing
1) .loc — label-based, inclusive, human-friendly
Use .loc when you want rows/columns by labels (index names or column names).
# Assume df has index ['a','b','c'] and columns ['x','y','z']
df.loc['a', 'x'] # scalar by label
df.loc['a':'c', ['x','z']] # slices are inclusive of the end
Micro explanation: .loc slices include the stop label (unlike standard Python slices). Great when working with dates or named indices.
2) .iloc — position-based, zero-indexed, NumPy-esque
Use .iloc for integer positions (like NumPy indexing):
df.iloc[0, 2] # first row, third column
df.iloc[0:3, 0:2] # half-open slices like NumPy
Think: .iloc behaves like NumPy arrays — perfect when you’re translating NumPy logic to pandas.
3) [] operator — shorthand, but ambiguous
- df['col'] → returns a Series (one column)
- df[['col1','col2']] → DataFrame (multiple columns)
- df[mask] → row selection using boolean mask
Avoid using df[row_slice] for complex selections — prefer .loc/.iloc for clarity.
4) Boolean indexing — filter like a pro
mask = df['age'] > 30
df[mask]
# or directly
df[df['salary'] > 50000]
Combine masks with & and | (remember parentheses):
df[(df['age'] > 30) & (df['dept'] == 'sales')]
Tip: Use .query() for cleaner boolean expressions: df.query('age > 30 and dept == "sales"')
Fast lookups: .at and .iat
- .at[label_row, label_col] — fastest label-based scalar access
- .iat[int_row, int_col] — fastest position-based scalar access
Use these inside loops only if you must; vectorized operations are still preferable.
Fancy indexing and alignment
You can pass lists or arrays to .loc/.iloc for fancy selection:
cols = ['x','z']
rows = ['a','c']
df.loc[rows, cols]
Important: when you assign using fancy indexing you may get copies — see SettingWithCopyWarning. Use .loc for assignment when you want to guarantee modifying the original:
df.loc[df['flag']==True, 'status'] = 'ok'
Boolean indexing with NumPy speed tricks
If you have a NumPy boolean array (from vectorized operations), you can use it directly with .iloc when it’s position-aligned, or use methods like df.values when appropriate. But be careful: mixing raw arrays with label indexing can misalign.
Example building on NumPy broadcasting:
# Using NumPy to compute a mask quickly
import numpy as np
vals = df['value'].to_numpy()
mask = (vals > np.mean(vals)) & (vals < np.percentile(vals, 90))
df.loc[mask] # works if mask aligns with df index
Slicing datetimes and index tricks
If your index is a DateTimeIndex, .loc shines:
df.loc['2020-01-01':'2020-03-31'] # inclusive slice of dates
You can also use partial string selection: df.loc['2020'] returns all rows in 2020.
MultiIndex (hierarchical) quick-start
MultiIndex lets you index on multiple levels. Example:
# Suppose df.index is MultiIndex: (country, city)
df.loc[('USA', 'New York'), :] # select a leaf
# or partial:
df.loc['USA'] # all cities in USA
Use pd.IndexSlice for complex slices across levels.
SettingWithCopyWarning — the drama you can prevent
If pandas isn’t sure whether your operation modifies the original or a copy, it raises SettingWithCopyWarning. Common problematic pattern:
subset = df[df['x'] > 0]
subset['y'] = 2 # might warn — subset could be a copy
Fix it using .loc on the original DataFrame:
df.loc[df['x'] > 0, 'y'] = 2
Rule of thumb: assign into the original using .loc to avoid ambiguity.
Extra tools: reindexing, .where, and .query
- df.reindex([...]) — conform to a new index (fills with NaN where missing)
- df.where(cond, other=...) — keep values where cond True, otherwise replace
- df.query('expr') — readable boolean filters, often faster for complex queries
Example of .where:
df['val'] = df['val'].where(df['val'] > 0, 0) # replace negatives with 0
Practical patterns (cheat-sheet)
- Select columns: df[['a','b']]
- Select rows by position: df.iloc[2:5]
- Select rows by label: df.loc['2020-01-01':'2020-01-31']
- Scalar get: df.at['row','col'] or df.iat[2,1]
- Boolean filter: df[df['score'] > 80]
- Assign safely: df.loc[mask, 'new_col'] = value
Why this is easier with NumPy knowledge
If you grok NumPy slicing and broadcasting, .iloc will feel natural. Use .iloc for zero-based numerical thinking and .loc when your labels matter. Convert to/from NumPy with .to_numpy() when you want raw speed and vectorized math — but remember indexing semantics may differ.
Key takeaways
- .loc = label-based (inclusive end), .iloc = position-based (NumPy-like).
- Use .at/.iat for fastest scalar access.
- Prefer .loc for assignments to avoid SettingWithCopyWarning.
- Boolean indexing + .query() = expressive filtering; combine with NumPy masks for speed.
- DateTimeIndex and MultiIndex are powerful once you master label vs position thinking.
“Indexing is the bridge between your question and pandas' answer. Ask precisely.”
Want a tiny challenge? Try: filter a DataFrame to rows where the 90th percentile of a numeric column (computed with NumPy) is exceeded, then set a status column for those rows — using only .loc and vectorized ops.
Go forth, slice responsibly, and remember: labels are feelings; positions are facts. Use both wisely.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!