Data Analysis with pandas
Manipulate and analyze tabular data using pandas for indexing, joins, time series, and robust I/O.
Content
GroupBy and Aggregations
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
GroupBy and Aggregations in pandas — The Split‑Apply‑Combine Party
"GroupBy in pandas is basically a potluck: everyone brings their rows, you sort them into tables, and then you decide whether you're having a buffet (agg) or individual dishes (transform)."
You already know how to sort, rank, and convert types in pandas (nice work on Positions 6–7). You also remember how NumPy turbocharges array work — broadcasting, vectorized ops, and all that spicy speed. GroupBy is where these ideas meet: you split data into groups, apply summarized or vectorized operations (often using NumPy), and combine results back into a table. Let's dive in — with fewer tears and more elegant code.
Why GroupBy matters (quick reminder)
- Want totals per store? Means per product? Counts per day? That's GroupBy.
- It's the tool for aggregating, summarizing, and creating group-level features for modeling.
- When you learned ranking and sorting, you prepared the data to be meaningfully grouped; when you learned categories, you learned how to speed it up.
The core concept: Split → Apply → Combine
- Split the DataFrame into groups by one or more keys.
- Apply a function (aggregation, transformation, filtering, or custom apply) to each group.
- Combine the group results back into a new DataFrame/Series.
Minimal example
import pandas as pd
import numpy as np
df = pd.DataFrame({
'store': ['A','A','B','B','B'],
'product': ['x','y','x','y','x'],
'revenue': [100, 200, 150, 300, 120]
})
# sum of revenue per store
df.groupby('store')['revenue'].sum()
Simple, readable, and delicious.
Aggregations: count, sum, mean, and friends
pandas provides many built-ins and also accepts NumPy functions. Use .agg() (alias .aggregate) to call multiple aggregations at once.
# multiple aggregations on revenue
df.groupby('store')['revenue'].agg(['sum','mean','count'])
# or mix functions
df.groupby('store').agg(total_revenue=('revenue','sum'),
avg_rev=('revenue','mean'),
n=('revenue','count'))
The named-aggregation syntax (right) is clean and produces nice column names.
Pro tip (performance):
- Use built-in strings ('sum','mean') or NumPy functions (np.sum) — they map to optimized cython paths.
- Avoid
applywhen a vectorizedaggcan do the job.
transform vs aggregate vs apply — pick your weapon
- .agg / .aggregate: returns a reduced result per group (one row per group per aggregation).
- .transform: returns an output aligned with the original index — great for creating group-level features (e.g., z-score within group).
- .apply: runs an arbitrary function on each group and tries to stitch results together — flexible but slower.
Example: create group-normalized revenue (z-score) using NumPy vectorization:
# z-score within store
df['rev_z'] = df.groupby('store')['revenue'].transform(lambda x: (x - x.mean()) / x.std())
# vectorized using NumPy (faster at scale)
grouped = df.groupby('store')['revenue']
means = grouped.transform('mean')
stds = grouped.transform('std')
df['rev_z2'] = (df['revenue'] - means) / stds
We leaned on .transform so results align with the original rows — essential when you want to keep the same shape.
Multiple group keys, levels, and categoricals
Group by multiple columns or an index level:
# group by store and product
df.groupby(['store','product'])['revenue'].sum()
# if your store is a category (from Position 6), grouping is faster
df['store'] = df['store'].astype('category')
Categorical dtypes reduce memory and speed up groupby, especially when there are many repeated keys.
Working with time: pd.Grouper
For time-series grouping (resample-like behavior on DataFrames):
df['date'] = pd.to_datetime(df['date'])
# group by month
df.groupby(pd.Grouper(key='date', freq='M'))['revenue'].sum()
Use this when you need flexible time bins but still want groupby semantics.
Advanced patterns & real-world tricks
- Named aggregations across columns: aggregate different columns with different funcs.
df.groupby('store').agg(total=('revenue','sum'),
avg_price=('price','mean'),
sold=('quantity','sum'))
- Filter groups (keep groups meeting a condition):
# keep groups with at least 2 rows
df.groupby('store').filter(lambda g: len(g) >= 2)
- cumcount and rank within groups (you learned ranking already):
# order by revenue within store and get position
df = df.sort_values(['store','revenue'], ascending=[True, False])
df['rank_in_store'] = df.groupby('store').cumcount() + 1
# or use rank() for ties-aware ranking
- merge group results back: frequently you compute a group stat then join to original DataFrame using
transformormergefrom.reset_index().
When to prefer pivot_table vs groupby
- pivot_table is handy when your result is a matrix (groups × columns) — it handles aggfunc and fills NaNs.
- groupby is more general and composable. If you need to do multiple aggregations with different outputs, groupby + named-agg wins.
Speed & memory tips (the part your CPU thanks you for)
- Use categorical keys (Position 6) when grouping on repeated strings.
- Use vectorized NumPy functions in
.agg(np.mean, np.sum) rather than custom Python loops. - Avoid
.applyon large groups; prefer.aggor.transformwhen possible. - Pass
sort=Falseto groupby if you don't need sorted group keys — slightly faster. - If you need group-specific cumulative or sliding windows, look into
groupby().cum*methods androllingwithgroupby.applysparingly.
Common gotchas
- GroupBy loses original index shape when you aggregate — use
as_index=Falseor.reset_index()to keep it tidy. .transformmust return the same index length as the group — otherwise you'll get an error.- Beware mixing different agg outputs with
apply; pandas may return an awkwardly shaped DataFrame.
Quick recipe catalog (cheat sheet)
- Sum per group: df.groupby('k')['v'].sum()
- Multiple aggs: df.groupby('k').agg(['sum','mean']) or named-agg
- Add group stat as column: df['grp_sum'] = df.groupby('k')['v'].transform('sum')
- Filter groups: df.groupby('k').filter(lambda g: condition)
- Group time: df.groupby(pd.Grouper(key='date', freq='W')).sum()
Final takeaways (so you can flex in interviews)
- GroupBy = Split → Apply → Combine. It's the canonical summary tool.
- Prefer vectorized aggregations and NumPy functions for speed; use transform for row-aligned group features.
- Use categoricals and as_index=False to manage speed and output shape.
"If data is a crowd, GroupBy is the bouncer who says 'you in the same group — come sit together and we'll summarize you.'"
Go try: compute monthly revenue per product, then create each row's ratio to its monthly total (transform), and finally rank rows within each product-month (cumcount/rank). You'll be using everything you've learned: grouping, time bucketing, ranking, and vectorized math. Beautiful.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!