Data Analysis with pandas
Manipulate and analyze tabular data using pandas for indexing, joins, time series, and robust I/O.
Content
Filtering and query
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Filtering and query in pandas — Make your DataFrame obey
"Filtering is just selection wearing a disguise — with extra attitude." — Your future self when debugging a broken mask
You've already seen how to load data (Reading CSV and Excel) and how to point at rows and columns (Indexing and Selection). You also learned fast numerical tricks using NumPy (vectorized ops, broadcasting). Now we put those together: how to slice, dice, and surgically extract the exact rows you want using boolean masks, pandas convenience methods, and the handy .query() language.
Why this matters (and where it shows up)
- Need the Sales team only? Filter.
- Want rows where score is missing to impute later? Filter.
- Running group-level analytics on a subset? Filter.
Filtering is the gatekeeper of downstream analysis. If your filter is wrong, everything downstream is wrong — and that cup of coffee won't save you.
Quick reminder (you've seen these already)
- From Indexing and Selection: use .loc for label-based selection and .iloc for position-based selection.
- From NumPy: boolean arrays are fast; pandas uses them under the hood. Combine pandas masks with NumPy logical operations for speed when needed.
Boolean masks: the bread and butter
Basic pattern
# assume df is already loaded
mask = df['age'] > 30
df[mask] # returns rows where mask is True
# or equivalently
df.loc[mask]
- mask is a Series of True/False values (like a NumPy boolean array).
- Use .loc when you plan to select columns too: df.loc[mask, ['name','salary']]
Combine conditions
Always use parentheses and bitwise operators (& for AND, | for OR, ~ for NOT):
df[(df['age'] > 30) & (df['department'] == 'Sales')]
# or with .loc
df.loc[(df['age'] > 30) & (df['department'] == 'Sales'), 'salary']
Why not use Python and/or? Because pandas overrides bitwise operators to work elementwise; Python's and/or try to evaluate truthiness of entire Series and explode spectacularly.
Useful filtering functions and idioms
- isin: filter by membership (like SQL IN)
df[df['department'].isin(['Sales', 'HR'])]
- between: inclusive numeric range
df[df['salary'].between(50000, 80000)]
- str.contains for text columns (remember na=False)
df[df['email'].str.contains('@example.com', na=False)]
- isnull / notnull for missing values
df[df['joined'].isnull()]
- numpy logical functions (occasionally useful)
import numpy as np
mask = np.logical_and(df['age'] > 30, df['salary'] > 50000)
df.loc[mask]
This leverages NumPy's vectorized speed when you want to compose masks externally.
The .query() method — readable and fast
.query() lets you write boolean logic as a string expression. It's tidy for complex filters:
min_age = 30
df.query('age >= @min_age and department == "Sales"')
- Use @ to reference Python variables inside the query string.
- Use backticks for column names with spaces: df.query('
Total Sales> 1000') - Under the hood pandas can use the 'numexpr' engine which is faster for large DataFrames.
Limitations: .query can't call arbitrary pandas methods (like .str.contains) with the default engine. You can set engine='python' to use Python expressions, but that may be slower.
Avoid the chained indexing trap
Bad:
df[df['A'] > 0]['B'] = 5 # may raise SettingWithCopyWarning or silently fail
Good:
mask = df['A'] > 0
df.loc[mask, 'B'] = 5 # safe, does in-place assignment
Chained indexing sometimes returns a view, sometimes a copy — pandas won't tell you which. Use .loc to be explicit.
Performance tips
- Boolean indexing is vectorized and fast; avoid Python loops.
- For very large frames, .query() / eval() with numexpr can be faster.
- Convert high-cardinality repeated strings to categorical dtype to speed up comparisons and reduce memory.
- If you build masks repeatedly, compute them once: mask = (cond); df.loc[mask, cols]
Example: using NumPy speed
# when combining many conditions, construct masks with numpy
import numpy as np
m1 = df['age'].values > 30
m2 = ~pd.isnull(df['salary'].values)
mask = np.logical_and(m1, m2)
df.loc[mask]
This avoids constructing many intermediate Series objects.
Real-world mini example
import pandas as pd
import numpy as np
df = pd.DataFrame({
'name': ['Alice','Bob','Carla','Dan'],
'age': [29, 42, 35, 28],
'department': ['Sales','HR','Sales','IT'],
'salary': [70000, 54000, np.nan, 48000]
})
# Who in Sales older than 30 with known salary?
mask = (df['department'] == 'Sales') & (df['age'] > 30) & df['salary'].notnull()
df.loc[mask]
# Same with query
df.query('department == "Sales" and age > 30 and salary == salary')
# salary == salary is a quick trick to test not-null in query (or use 'salary == salary')
Why people keep misunderstanding this
Because filtering looks simple until you mix up boolean operators, forget parentheses, or accidentally trigger chained indexing. Also because a Series is not a Python bool — it’s a vector of them. Treat it like the vector that it is.
Quick checklist before you run a filter
- Did I use parentheses around conditions?
- Did I use & / | / ~ instead of and / or / not?
- If assigning, am I using .loc to avoid SettingWithCopyWarning?
- Does .query() improve readability/performance for this case?
- Are text comparisons using .str methods with na=False when necessary?
Key takeaways
- Filtering = Boolean masks + selection. Build masks with Series operations or NumPy for speed.
- Use .loc for selection and assignment to be explicit and safe.
- .query() is great for readable filters and can be faster via numexpr; use @ to inject Python variables.
- Avoid chained indexing and prefer masks computed once.
Final thought: a correct filter is the single most influential line in your analysis script. Treat it like a tiny truth oracle — check it, test it, and laugh at the rows you excluded.
If you want, I can generate a small cheat-sheet image or a runnable Jupyter notebook with these examples and more corner cases (e.g., backticks, engine options, performance benchmarks). Say the word and I’ll conjure it like a debugging genie.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!