jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Python for Data Science, AI & Development
Chapters

1Python Foundations for Data Work

2Data Structures and Iteration

3Numerical Computing with NumPy

4Data Analysis with pandas

Series and DataFrame BasicsReading CSV and ExcelIndexing and SelectionFiltering and queryHandling Missing ValuesType Conversion and CategoriesSorting and RankingGroupBy and AggregationsApply and Vectorized OpsMerge, Join, and ConcatPivot Tables and CrosstabsTime Series with pandasWindow and Rolling OpsString Methods and RegexDatabase I/O with SQLAlchemy

5Data Cleaning and Feature Engineering

6Data Visualization and Storytelling

7Statistics and Probability for Data Science

8Machine Learning with scikit-learn

9Deep Learning Foundations

10Data Sources, Engineering, and Deployment

Courses/Python for Data Science, AI & Development/Data Analysis with pandas

Data Analysis with pandas

42399 views

Manipulate and analyze tabular data using pandas for indexing, joins, time series, and robust I/O.

Content

8 of 15

GroupBy and Aggregations

pandas GroupBy and Aggregations Explained Clearly (with Examples)
3621 views
beginner
intermediate
humorous
pandas
data-science
gpt-5-mini
3621 views

Versions:

pandas GroupBy and Aggregations Explained Clearly (with Examples)

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

GroupBy and Aggregations in pandas — The Split‑Apply‑Combine Party

"GroupBy in pandas is basically a potluck: everyone brings their rows, you sort them into tables, and then you decide whether you're having a buffet (agg) or individual dishes (transform)."

You already know how to sort, rank, and convert types in pandas (nice work on Positions 6–7). You also remember how NumPy turbocharges array work — broadcasting, vectorized ops, and all that spicy speed. GroupBy is where these ideas meet: you split data into groups, apply summarized or vectorized operations (often using NumPy), and combine results back into a table. Let's dive in — with fewer tears and more elegant code.


Why GroupBy matters (quick reminder)

  • Want totals per store? Means per product? Counts per day? That's GroupBy.
  • It's the tool for aggregating, summarizing, and creating group-level features for modeling.
  • When you learned ranking and sorting, you prepared the data to be meaningfully grouped; when you learned categories, you learned how to speed it up.

The core concept: Split → Apply → Combine

  1. Split the DataFrame into groups by one or more keys.
  2. Apply a function (aggregation, transformation, filtering, or custom apply) to each group.
  3. Combine the group results back into a new DataFrame/Series.

Minimal example

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'store': ['A','A','B','B','B'],
    'product': ['x','y','x','y','x'],
    'revenue': [100, 200, 150, 300, 120]
})

# sum of revenue per store
df.groupby('store')['revenue'].sum()

Simple, readable, and delicious.


Aggregations: count, sum, mean, and friends

pandas provides many built-ins and also accepts NumPy functions. Use .agg() (alias .aggregate) to call multiple aggregations at once.

# multiple aggregations on revenue
df.groupby('store')['revenue'].agg(['sum','mean','count'])

# or mix functions
df.groupby('store').agg(total_revenue=('revenue','sum'),
                         avg_rev=('revenue','mean'),
                         n=('revenue','count'))

The named-aggregation syntax (right) is clean and produces nice column names.

Pro tip (performance):

  • Use built-in strings ('sum','mean') or NumPy functions (np.sum) — they map to optimized cython paths.
  • Avoid apply when a vectorized agg can do the job.

transform vs aggregate vs apply — pick your weapon

  • .agg / .aggregate: returns a reduced result per group (one row per group per aggregation).
  • .transform: returns an output aligned with the original index — great for creating group-level features (e.g., z-score within group).
  • .apply: runs an arbitrary function on each group and tries to stitch results together — flexible but slower.

Example: create group-normalized revenue (z-score) using NumPy vectorization:

# z-score within store
df['rev_z'] = df.groupby('store')['revenue'].transform(lambda x: (x - x.mean()) / x.std())

# vectorized using NumPy (faster at scale)
grouped = df.groupby('store')['revenue']
means = grouped.transform('mean')
stds = grouped.transform('std')
df['rev_z2'] = (df['revenue'] - means) / stds

We leaned on .transform so results align with the original rows — essential when you want to keep the same shape.


Multiple group keys, levels, and categoricals

Group by multiple columns or an index level:

# group by store and product
df.groupby(['store','product'])['revenue'].sum()

# if your store is a category (from Position 6), grouping is faster
df['store'] = df['store'].astype('category')

Categorical dtypes reduce memory and speed up groupby, especially when there are many repeated keys.


Working with time: pd.Grouper

For time-series grouping (resample-like behavior on DataFrames):

df['date'] = pd.to_datetime(df['date'])
# group by month
df.groupby(pd.Grouper(key='date', freq='M'))['revenue'].sum()

Use this when you need flexible time bins but still want groupby semantics.


Advanced patterns & real-world tricks

  • Named aggregations across columns: aggregate different columns with different funcs.
df.groupby('store').agg(total=('revenue','sum'),
                        avg_price=('price','mean'),
                        sold=('quantity','sum'))
  • Filter groups (keep groups meeting a condition):
# keep groups with at least 2 rows
df.groupby('store').filter(lambda g: len(g) >= 2)
  • cumcount and rank within groups (you learned ranking already):
# order by revenue within store and get position
df = df.sort_values(['store','revenue'], ascending=[True, False])
df['rank_in_store'] = df.groupby('store').cumcount() + 1
# or use rank() for ties-aware ranking
  • merge group results back: frequently you compute a group stat then join to original DataFrame using transform or merge from .reset_index().

When to prefer pivot_table vs groupby

  • pivot_table is handy when your result is a matrix (groups × columns) — it handles aggfunc and fills NaNs.
  • groupby is more general and composable. If you need to do multiple aggregations with different outputs, groupby + named-agg wins.

Speed & memory tips (the part your CPU thanks you for)

  • Use categorical keys (Position 6) when grouping on repeated strings.
  • Use vectorized NumPy functions in .agg (np.mean, np.sum) rather than custom Python loops.
  • Avoid .apply on large groups; prefer .agg or .transform when possible.
  • Pass sort=False to groupby if you don't need sorted group keys — slightly faster.
  • If you need group-specific cumulative or sliding windows, look into groupby().cum* methods and rolling with groupby.apply sparingly.

Common gotchas

  • GroupBy loses original index shape when you aggregate — use as_index=False or .reset_index() to keep it tidy.
  • .transform must return the same index length as the group — otherwise you'll get an error.
  • Beware mixing different agg outputs with apply; pandas may return an awkwardly shaped DataFrame.

Quick recipe catalog (cheat sheet)

  • Sum per group: df.groupby('k')['v'].sum()
  • Multiple aggs: df.groupby('k').agg(['sum','mean']) or named-agg
  • Add group stat as column: df['grp_sum'] = df.groupby('k')['v'].transform('sum')
  • Filter groups: df.groupby('k').filter(lambda g: condition)
  • Group time: df.groupby(pd.Grouper(key='date', freq='W')).sum()

Final takeaways (so you can flex in interviews)

  • GroupBy = Split → Apply → Combine. It's the canonical summary tool.
  • Prefer vectorized aggregations and NumPy functions for speed; use transform for row-aligned group features.
  • Use categoricals and as_index=False to manage speed and output shape.

"If data is a crowd, GroupBy is the bouncer who says 'you in the same group — come sit together and we'll summarize you.'"

Go try: compute monthly revenue per product, then create each row's ratio to its monthly total (transform), and finally rank rows within each product-month (cumcount/rank). You'll be using everything you've learned: grouping, time bucketing, ranking, and vectorized math. Beautiful.

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics