Courses/Python for Data Science, AI & Development/Data Analysis with pandas

Data Analysis with pandas

42410 views

Manipulate and analyze tabular data using pandas for indexing, joins, time series, and robust I/O.

Content

8 of 15

GroupBy and Aggregations

pandas GroupBy and Aggregations Explained Clearly (with Examples)

3621 views

beginner

intermediate

humorous

pandas

data-science

gpt-5-mini

3621 views

Versions:

pandas GroupBy and Aggregations Explained Clearly (with Examples)

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

GroupBy and Aggregations in pandas — The Split‑Apply‑Combine Party

"GroupBy in pandas is basically a potluck: everyone brings their rows, you sort them into tables, and then you decide whether you're having a buffet (agg) or individual dishes (transform)."

You already know how to sort, rank, and convert types in pandas (nice work on Positions 6–7). You also remember how NumPy turbocharges array work — broadcasting, vectorized ops, and all that spicy speed. GroupBy is where these ideas meet: you split data into groups, apply summarized or vectorized operations (often using NumPy), and combine results back into a table. Let's dive in — with fewer tears and more elegant code.

Why GroupBy matters (quick reminder)

Want totals per store? Means per product? Counts per day? That's GroupBy.
It's the tool for aggregating, summarizing, and creating group-level features for modeling.
When you learned ranking and sorting, you prepared the data to be meaningfully grouped; when you learned categories, you learned how to speed it up.

The core concept: Split → Apply → Combine

Split the DataFrame into groups by one or more keys.
Apply a function (aggregation, transformation, filtering, or custom apply) to each group.
Combine the group results back into a new DataFrame/Series.

Minimal example

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'store': ['A','A','B','B','B'],
    'product': ['x','y','x','y','x'],
    'revenue': [100, 200, 150, 300, 120]
})

# sum of revenue per store
df.groupby('store')['revenue'].sum()

Simple, readable, and delicious.

Aggregations: count, sum, mean, and friends

pandas provides many built-ins and also accepts NumPy functions. Use .agg() (alias .aggregate) to call multiple aggregations at once.

# multiple aggregations on revenue
df.groupby('store')['revenue'].agg(['sum','mean','count'])

# or mix functions
df.groupby('store').agg(total_revenue=('revenue','sum'),
                         avg_rev=('revenue','mean'),
                         n=('revenue','count'))

The named-aggregation syntax (right) is clean and produces nice column names.

Pro tip (performance):

Use built-in strings ('sum','mean') or NumPy functions (np.sum) — they map to optimized cython paths.
Avoid apply when a vectorized agg can do the job.

transform vs aggregate vs apply — pick your weapon

.agg / .aggregate: returns a reduced result per group (one row per group per aggregation).
.transform: returns an output aligned with the original index — great for creating group-level features (e.g., z-score within group).
.apply: runs an arbitrary function on each group and tries to stitch results together — flexible but slower.

Example: create group-normalized revenue (z-score) using NumPy vectorization:

# z-score within store
df['rev_z'] = df.groupby('store')['revenue'].transform(lambda x: (x - x.mean()) / x.std())

# vectorized using NumPy (faster at scale)
grouped = df.groupby('store')['revenue']
means = grouped.transform('mean')
stds = grouped.transform('std')
df['rev_z2'] = (df['revenue'] - means) / stds

We leaned on .transform so results align with the original rows — essential when you want to keep the same shape.

Multiple group keys, levels, and categoricals

Group by multiple columns or an index level:

# group by store and product
df.groupby(['store','product'])['revenue'].sum()

# if your store is a category (from Position 6), grouping is faster
df['store'] = df['store'].astype('category')

Categorical dtypes reduce memory and speed up groupby, especially when there are many repeated keys.

Working with time: pd.Grouper

For time-series grouping (resample-like behavior on DataFrames):

df['date'] = pd.to_datetime(df['date'])
# group by month
df.groupby(pd.Grouper(key='date', freq='M'))['revenue'].sum()

Use this when you need flexible time bins but still want groupby semantics.

Advanced patterns & real-world tricks

Named aggregations across columns: aggregate different columns with different funcs.

df.groupby('store').agg(total=('revenue','sum'),
                        avg_price=('price','mean'),
                        sold=('quantity','sum'))

Filter groups (keep groups meeting a condition):

# keep groups with at least 2 rows
df.groupby('store').filter(lambda g: len(g) >= 2)

cumcount and rank within groups (you learned ranking already):

# order by revenue within store and get position
df = df.sort_values(['store','revenue'], ascending=[True, False])
df['rank_in_store'] = df.groupby('store').cumcount() + 1
# or use rank() for ties-aware ranking

merge group results back: frequently you compute a group stat then join to original DataFrame using transform or merge from .reset_index().

When to prefer pivot_table vs groupby

pivot_table is handy when your result is a matrix (groups × columns) — it handles aggfunc and fills NaNs.
groupby is more general and composable. If you need to do multiple aggregations with different outputs, groupby + named-agg wins.

Speed & memory tips (the part your CPU thanks you for)

Use categorical keys (Position 6) when grouping on repeated strings.
Use vectorized NumPy functions in .agg (np.mean, np.sum) rather than custom Python loops.
Avoid .apply on large groups; prefer .agg or .transform when possible.
Pass sort=False to groupby if you don't need sorted group keys — slightly faster.
If you need group-specific cumulative or sliding windows, look into groupby().cum* methods and rolling with groupby.apply sparingly.

Common gotchas

GroupBy loses original index shape when you aggregate — use as_index=False or .reset_index() to keep it tidy.
.transform must return the same index length as the group — otherwise you'll get an error.
Beware mixing different agg outputs with apply; pandas may return an awkwardly shaped DataFrame.

Quick recipe catalog (cheat sheet)

Sum per group: df.groupby('k')['v'].sum()
Multiple aggs: df.groupby('k').agg(['sum','mean']) or named-agg
Add group stat as column: df['grp_sum'] = df.groupby('k')['v'].transform('sum')
Filter groups: df.groupby('k').filter(lambda g: condition)
Group time: df.groupby(pd.Grouper(key='date', freq='W')).sum()

Final takeaways (so you can flex in interviews)

GroupBy = Split → Apply → Combine. It's the canonical summary tool.
Prefer vectorized aggregations and NumPy functions for speed; use transform for row-aligned group features.
Use categoricals and as_index=False to manage speed and output shape.

"If data is a crowd, GroupBy is the bouncer who says 'you in the same group — come sit together and we'll summarize you.'"

Go try: compute monthly revenue per product, then create each row's ratio to its monthly total (transform), and finally rank rows within each product-month (cumcount/rank). You'll be using everything you've learned: grouping, time bucketing, ranking, and vectorized math. Beautiful.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics