Data Analysis with pandas
Manipulate and analyze tabular data using pandas for indexing, joins, time series, and robust I/O.
Content
Sorting and Ranking
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Sorting and Ranking in pandas — Make Your Data Line Up Neatly
"If your data were children at recess, sorting is lining them up by height; ranking is handing out medals. Both matter."
You're already comfortable converting types, working with categories, handling missing values, and doing fast numeric work with NumPy. This lesson plugs into that flow: we sort to see structure and rank to quantify position. These operations are tiny tools that make big differences in data cleaning, exploratory analysis, and feature engineering for ML.
What this covers (quickly)
- How to sort DataFrame and Series with pandas'
sort_valuesandsort_index - Multi-column sorts, stable sorts, and the handy
key=transform - Ranking with
rank()— methods,pct, grouping, and handling NaNs - When to fall back to NumPy for speed
This tutorial uses small example snippets so you can copy-paste and play.
Why sorting matters (and when ranking is better)
- Sorting: visual ordering. Useful for inspection, slicing top-k, and ordering before time-series operations.
- Ranking: relative position. Useful for percentiles, tie-handling in leaderboards, or model features.
Imagine you want the top 3 performers per group. Sorting gets you the rows; ranking gives each entry a numeric place so you can filter with rank <= 3.
Quick example DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame({
'id': [11, 22, 33, 44, 55, 66],
'group': ['A', 'B', 'A', 'B', 'A', 'B'],
'score': [88, np.nan, 92, 88, 92, 75],
'name': ['alice', 'Bob', 'ALAN', 'bob', 'zoe', 'zoe']
})
This has: NaNs in score, mixed-case names (hello key=), repeated scores (ties), and groups for grouped operations.
Sorting basics
df.sort_values(by='score', ascending=False)sorts by a columndf.sort_index()sorts by row labelsdf.sort_values(['group', 'score'], ascending=[True, False])is multi-column
Examples:
# highest scores first, NaNs go to bottom by default
df.sort_values(by='score', ascending=False)
# keep stable order for ties (pandas uses 'quicksort' or you can choose 'mergesort')
df.sort_values(by='score', ascending=False, kind='mergesort')
# multi-column: group asc, score desc
df.sort_values(by=['group', 'score'], ascending=[True, False])
Case-insensitive sorting with key=
If strings have mixed case and you want case-insensitive sort, use key= which receives the column series and returns transformed values (leveraging NumPy/Pandas vectorized ops):
# sort by name case-insensitively
df.sort_values(by='name', key=lambda col: col.str.lower())
key= is great because it applies the transform only for sorting, not permanently changing the column (so you can avoid unnecessary type conversions covered earlier).
Categories and sorting
If you followed the previous 'Type Conversion and Categories' lesson, you know categorical dtype can enforce a custom order. Sorting respects ordered categoricals:
cat = pd.Categorical(['low', 'medium', 'high'], ordered=True)
df['priority'] = pd.Categorical(['high', 'low', 'medium', 'low', 'high', 'medium'], categories=cat.categories, ordered=True)
df.sort_values(by='priority')
This is cleaner than ad-hoc mapping to ints.
Ranking: numeric positions, ties, and percentiles
Series.rank() returns the rank of each value.
Parameters to know:
method: 'average' (default), 'min', 'max', 'first', 'dense'ascending: True/Falsepct: if True, returns percentile rank between 0 and 1na_option: 'keep'|'top'|'bottom' — where to place NaNs
s = df['score']
print(s.rank()) # default average ranking
print(s.rank(ascending=False)) # higher score = rank 1
print(s.rank(method='dense')) # dense: ranks don't skip numbers
print(s.rank(pct=True)) # percentile (0..1)
Ranking per group (very common)
# give top performer rank 1 within each group
df['rank_in_group'] = df.groupby('group')['score'].rank(ascending=False, method='dense')
This is your go-to for leaderboards, per-segment scoring, or feature creation in ML pipelines.
Handling NaNs when ranking
You can use na_option or pre-process:
# keep NaNs (rank returns NaN)
df['score'].rank(na_option='keep')
# treat NaNs as worst (use bottom)
df['score'].rank(na_option='bottom')
# or fill NaNs before ranking if they should be considered the lowest
df['score'].fillna(-999).rank(ascending=False)
Refer back to 'Handling Missing Values' for patterns on imputation vs. keeping NaNs.
When you might use NumPy instead
If you need maximum speed for large arrays, leverage NumPy's argsort and vectorized ops (we covered this in 'Numerical Computing with NumPy'). Example: get ranking by position (fast):
order = np.argsort(-df['score'].values) # descending
ranks = np.empty_like(order)
ranks[order] = np.arange(len(df)) + 1
# ranks now contains 1..N positions (no tie-handling built-in)
Use NumPy when you are indexing large numeric arrays and want minimal Python overhead. Use pandas' rank() when you want tie-handling, grouping, or NaN-aware behavior — it's built for that.
Small checklist / tips
- Use
sort_valuesto reorder rows for human inspection, top-k, or stable pre-grouping. - Use
rank()to create numeric positions (with chosen tie behavior) for features and filters. - Use
key=for temporary transforms like case folding — avoids permanent dtype changes. - When performance matters, consider NumPy
argsortfor raw arrays, but prefer pandas when you need group-aware or NaN-aware behavior. - Remember categorical dtype supports custom order and is fast for repeated sorts.
Key takeaways
- Sort to see, rank to measure. Sorting is ordering; ranking assigns each row a numeric place.
key=is your friend for temporary transformations before sorting (case-insensitive sorts).- Choose
rankmethod carefully: 'average' vs 'dense' vs 'first' change how ties behave. - Handle NaNs intentionally: decide whether to keep, treat as top/bottom, or impute.
- Use NumPy for raw speed when you only need positional rankings and no tie logic.
Remember: ordering and ranking are deceptively powerful. Many data-cleaning and feature-engineering problems collapse into a small set of sort-and-rank operations. Next up, you might combine ranking with rolling/window calculations or convert ranks into categorical bins for models — both natural continuations from this lesson.
"If you ever feel lost in a messy table, sort it. If you need meaning beyond order, rank it."
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!