Courses/Python for Data Science, AI & Development/Data Analysis with pandas

Data Analysis with pandas

42410 views

Manipulate and analyze tabular data using pandas for indexing, joins, time series, and robust I/O.

Content

4 of 15

Filtering and query

Pandas Filtering and Query Explained: Boolean Masks

6269 views

beginner

pandas

data-analysis

humorous

gpt-5-mini

6269 views

Versions:

Pandas Filtering and Query Explained: Boolean Masks

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Filtering and query in pandas — Make your DataFrame obey

"Filtering is just selection wearing a disguise — with extra attitude." — Your future self when debugging a broken mask

You've already seen how to load data (Reading CSV and Excel) and how to point at rows and columns (Indexing and Selection). You also learned fast numerical tricks using NumPy (vectorized ops, broadcasting). Now we put those together: how to slice, dice, and surgically extract the exact rows you want using boolean masks, pandas convenience methods, and the handy .query() language.

Why this matters (and where it shows up)

Need the Sales team only? Filter.
Want rows where score is missing to impute later? Filter.
Running group-level analytics on a subset? Filter.

Filtering is the gatekeeper of downstream analysis. If your filter is wrong, everything downstream is wrong — and that cup of coffee won't save you.

Quick reminder (you've seen these already)

From Indexing and Selection: use .loc for label-based selection and .iloc for position-based selection.
From NumPy: boolean arrays are fast; pandas uses them under the hood. Combine pandas masks with NumPy logical operations for speed when needed.

Boolean masks: the bread and butter

Basic pattern

# assume df is already loaded
mask = df['age'] > 30
df[mask]            # returns rows where mask is True
# or equivalently
df.loc[mask]

mask is a Series of True/False values (like a NumPy boolean array).
Use .loc when you plan to select columns too: df.loc[mask, ['name','salary']]

Combine conditions

Always use parentheses and bitwise operators (& for AND, | for OR, ~ for NOT):

df[(df['age'] > 30) & (df['department'] == 'Sales')]
# or with .loc
df.loc[(df['age'] > 30) & (df['department'] == 'Sales'), 'salary']

Why not use Python and/or? Because pandas overrides bitwise operators to work elementwise; Python's and/or try to evaluate truthiness of entire Series and explode spectacularly.

Useful filtering functions and idioms

isin: filter by membership (like SQL IN)

df[df['department'].isin(['Sales', 'HR'])]

between: inclusive numeric range

df[df['salary'].between(50000, 80000)]

str.contains for text columns (remember na=False)

df[df['email'].str.contains('@example.com', na=False)]

isnull / notnull for missing values

df[df['joined'].isnull()]

numpy logical functions (occasionally useful)

import numpy as np
mask = np.logical_and(df['age'] > 30, df['salary'] > 50000)
df.loc[mask]

This leverages NumPy's vectorized speed when you want to compose masks externally.

The .query() method — readable and fast

.query() lets you write boolean logic as a string expression. It's tidy for complex filters:

min_age = 30
df.query('age >= @min_age and department == "Sales"')

Use @ to reference Python variables inside the query string.
Use backticks for column names with spaces: df.query('Total Sales > 1000')
Under the hood pandas can use the 'numexpr' engine which is faster for large DataFrames.

Limitations: .query can't call arbitrary pandas methods (like .str.contains) with the default engine. You can set engine='python' to use Python expressions, but that may be slower.

Avoid the chained indexing trap

Bad:

df[df['A'] > 0]['B'] = 5   # may raise SettingWithCopyWarning or silently fail

Good:

mask = df['A'] > 0
df.loc[mask, 'B'] = 5     # safe, does in-place assignment

Chained indexing sometimes returns a view, sometimes a copy — pandas won't tell you which. Use .loc to be explicit.

Performance tips

Boolean indexing is vectorized and fast; avoid Python loops.
For very large frames, .query() / eval() with numexpr can be faster.
Convert high-cardinality repeated strings to categorical dtype to speed up comparisons and reduce memory.
If you build masks repeatedly, compute them once: mask = (cond); df.loc[mask, cols]

Example: using NumPy speed

# when combining many conditions, construct masks with numpy
import numpy as np
m1 = df['age'].values > 30
m2 = ~pd.isnull(df['salary'].values)
mask = np.logical_and(m1, m2)
df.loc[mask]

This avoids constructing many intermediate Series objects.

Real-world mini example

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'name': ['Alice','Bob','Carla','Dan'],
    'age': [29, 42, 35, 28],
    'department': ['Sales','HR','Sales','IT'],
    'salary': [70000, 54000, np.nan, 48000]
})

# Who in Sales older than 30 with known salary?
mask = (df['department'] == 'Sales') & (df['age'] > 30) & df['salary'].notnull()
df.loc[mask]

# Same with query
df.query('department == "Sales" and age > 30 and salary == salary')
# salary == salary is a quick trick to test not-null in query (or use 'salary == salary')

Why people keep misunderstanding this

Because filtering looks simple until you mix up boolean operators, forget parentheses, or accidentally trigger chained indexing. Also because a Series is not a Python bool — it’s a vector of them. Treat it like the vector that it is.

Quick checklist before you run a filter

Did I use parentheses around conditions?
Did I use & / | / ~ instead of and / or / not?
If assigning, am I using .loc to avoid SettingWithCopyWarning?
Does .query() improve readability/performance for this case?
Are text comparisons using .str methods with na=False when necessary?

Key takeaways

Filtering = Boolean masks + selection. Build masks with Series operations or NumPy for speed.
Use .loc for selection and assignment to be explicit and safe.
.query() is great for readable filters and can be faster via numexpr; use @ to inject Python variables.
Avoid chained indexing and prefer masks computed once.

Final thought: a correct filter is the single most influential line in your analysis script. Treat it like a tiny truth oracle — check it, test it, and laugh at the rows you excluded.

If you want, I can generate a small cheat-sheet image or a runnable Jupyter notebook with these examples and more corner cases (e.g., backticks, engine options, performance benchmarks). Say the word and I’ll conjure it like a debugging genie.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics