Courses/Python for Data Science, AI & Development/Data Analysis with pandas

Data Analysis with pandas

42410 views

Manipulate and analyze tabular data using pandas for indexing, joins, time series, and robust I/O.

Content

7 of 15

Sorting and Ranking

Sorting and Ranking in pandas for Data Analysis (Beginner Guide)

1368 views

beginner

humorous

visual

python

pandas

gpt-5-mini

1368 views

Versions:

Sorting and Ranking in pandas for Data Analysis (Beginner Guide)

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Sorting and Ranking in pandas — Make Your Data Line Up Neatly

"If your data were children at recess, sorting is lining them up by height; ranking is handing out medals. Both matter."

You're already comfortable converting types, working with categories, handling missing values, and doing fast numeric work with NumPy. This lesson plugs into that flow: we sort to see structure and rank to quantify position. These operations are tiny tools that make big differences in data cleaning, exploratory analysis, and feature engineering for ML.

What this covers (quickly)

How to sort DataFrame and Series with pandas' sort_values and sort_index
Multi-column sorts, stable sorts, and the handy key= transform
Ranking with rank() — methods, pct, grouping, and handling NaNs
When to fall back to NumPy for speed

This tutorial uses small example snippets so you can copy-paste and play.

Why sorting matters (and when ranking is better)

Sorting: visual ordering. Useful for inspection, slicing top-k, and ordering before time-series operations.
Ranking: relative position. Useful for percentiles, tie-handling in leaderboards, or model features.

Imagine you want the top 3 performers per group. Sorting gets you the rows; ranking gives each entry a numeric place so you can filter with rank <= 3.

Quick example DataFrame

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'id': [11, 22, 33, 44, 55, 66],
    'group': ['A', 'B', 'A', 'B', 'A', 'B'],
    'score': [88, np.nan, 92, 88, 92, 75],
    'name': ['alice', 'Bob', 'ALAN', 'bob', 'zoe', 'zoe']
})

This has: NaNs in score, mixed-case names (hello key=), repeated scores (ties), and groups for grouped operations.

Sorting basics

df.sort_values(by='score', ascending=False) sorts by a column
df.sort_index() sorts by row labels
df.sort_values(['group', 'score'], ascending=[True, False]) is multi-column

Examples:

# highest scores first, NaNs go to bottom by default
df.sort_values(by='score', ascending=False)

# keep stable order for ties (pandas uses 'quicksort' or you can choose 'mergesort')
df.sort_values(by='score', ascending=False, kind='mergesort')

# multi-column: group asc, score desc
df.sort_values(by=['group', 'score'], ascending=[True, False])

Case-insensitive sorting with `key=`

If strings have mixed case and you want case-insensitive sort, use key= which receives the column series and returns transformed values (leveraging NumPy/Pandas vectorized ops):

# sort by name case-insensitively
df.sort_values(by='name', key=lambda col: col.str.lower())

key= is great because it applies the transform only for sorting, not permanently changing the column (so you can avoid unnecessary type conversions covered earlier).

Categories and sorting

If you followed the previous 'Type Conversion and Categories' lesson, you know categorical dtype can enforce a custom order. Sorting respects ordered categoricals:

cat = pd.Categorical(['low', 'medium', 'high'], ordered=True)
df['priority'] = pd.Categorical(['high', 'low', 'medium', 'low', 'high', 'medium'], categories=cat.categories, ordered=True)

df.sort_values(by='priority')

This is cleaner than ad-hoc mapping to ints.

Ranking: numeric positions, ties, and percentiles

Series.rank() returns the rank of each value.

Parameters to know:

method: 'average' (default), 'min', 'max', 'first', 'dense'
ascending: True/False
pct: if True, returns percentile rank between 0 and 1
na_option: 'keep'|'top'|'bottom' — where to place NaNs

s = df['score']
print(s.rank())                # default average ranking
print(s.rank(ascending=False)) # higher score = rank 1
print(s.rank(method='dense'))  # dense: ranks don't skip numbers
print(s.rank(pct=True))        # percentile (0..1)

Ranking per group (very common)

# give top performer rank 1 within each group
df['rank_in_group'] = df.groupby('group')['score'].rank(ascending=False, method='dense')

This is your go-to for leaderboards, per-segment scoring, or feature creation in ML pipelines.

Handling NaNs when ranking

You can use na_option or pre-process:

# keep NaNs (rank returns NaN)
df['score'].rank(na_option='keep')

# treat NaNs as worst (use bottom)
df['score'].rank(na_option='bottom')

# or fill NaNs before ranking if they should be considered the lowest
df['score'].fillna(-999).rank(ascending=False)

Refer back to 'Handling Missing Values' for patterns on imputation vs. keeping NaNs.

When you might use NumPy instead

If you need maximum speed for large arrays, leverage NumPy's argsort and vectorized ops (we covered this in 'Numerical Computing with NumPy'). Example: get ranking by position (fast):

order = np.argsort(-df['score'].values)   # descending
ranks = np.empty_like(order)
ranks[order] = np.arange(len(df)) + 1
# ranks now contains 1..N positions (no tie-handling built-in)

Use NumPy when you are indexing large numeric arrays and want minimal Python overhead. Use pandas' rank() when you want tie-handling, grouping, or NaN-aware behavior — it's built for that.

Small checklist / tips

Use sort_values to reorder rows for human inspection, top-k, or stable pre-grouping.
Use rank() to create numeric positions (with chosen tie behavior) for features and filters.
Use key= for temporary transforms like case folding — avoids permanent dtype changes.
When performance matters, consider NumPy argsort for raw arrays, but prefer pandas when you need group-aware or NaN-aware behavior.
Remember categorical dtype supports custom order and is fast for repeated sorts.

Key takeaways

Sort to see, rank to measure. Sorting is ordering; ranking assigns each row a numeric place.
key= is your friend for temporary transformations before sorting (case-insensitive sorts).
Choose rank method carefully: 'average' vs 'dense' vs 'first' change how ties behave.
Handle NaNs intentionally: decide whether to keep, treat as top/bottom, or impute.
Use NumPy for raw speed when you only need positional rankings and no tie logic.

Remember: ordering and ranking are deceptively powerful. Many data-cleaning and feature-engineering problems collapse into a small set of sort-and-rank operations. Next up, you might combine ranking with rolling/window calculations or convert ranks into categorical bins for models — both natural continuations from this lesson.

"If you ever feel lost in a messy table, sort it. If you need meaning beyond order, rank it."

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics