Courses/Python for Data Science, AI & Development/Data Structures and Iteration

Data Structures and Iteration

41530 views

Use Python collections and iteration patterns to write expressive, efficient, and readable data-oriented code.

Content

9 of 15

Sorting and Custom Keys

Sorting with Custom Keys in Python for Data Science

2171 views

beginner

python

data-science

sorting

gpt-5-mini

2171 views

Versions:

Sorting with Custom Keys in Python for Data Science

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Sorting and Custom Keys — Make Python Sort Like a Data Scientist

Ever noticed how a leaderboard can be beautifully ordered one moment and catastrophically wrong the next because someone used lexicographic sorting on numbers? Welcome to the low-key chaos of sorting data. We're going to tame it.

"Sorting is the quiet magician of data pipelines — it doesn't get applause, but everything breaks without it."

This lesson assumes you already know how to iterate with tools like enumerate, zip, and how to produce data lazily with generators. We'll build on that: use enumerate to keep original indices, zip to sort parallel arrays, and remember that generators need to be materialized (list(...)) before sorting.

What this page covers

How Python's sorted() and list.sort() use key functions to control ordering.
Practical patterns for data science: sorting dicts, rankings, top-k, stable multi-key sorts.
Performance tips (decorate-sort-undecorate, operator helpers, heapq for top-k).

Why this matters: sorting underpins ranking, sampling, deduplication, and feature engineering. If you sort wrong, your model gets garbage, your dashboard lies, and your product manager cries.

The basics: sorted(), list.sort(), and the key parameter

sorted(iterable, key=..., reverse=False) returns a new list.
list.sort(key=..., reverse=False) sorts in-place and is slightly faster and memory-friendlier.

The magic is the key function: Python computes key(item) for each item once and sorts by those keys. That single-call-per-item behavior is a performance superpower — use it.

Example: sort people by age

people = [
    {'name': 'Ana', 'age': 34},
    {'name': 'Ben', 'age': 28},
    {'name': 'Cara', 'age': 42}
]

# ascending by age
sorted_people = sorted(people, key=lambda p: p['age'])

# descending by age
sorted_people_desc = sorted(people, key=lambda p: p['age'], reverse=True)

Fast alternative with operator.itemgetter

Use operator.itemgetter when extracting dict keys or tuple indices; it's slightly faster and clearer.

from operator import itemgetter
sorted_people = sorted(people, key=itemgetter('age'))

Sorting parallel lists — zip to the rescue

If you have features and labels in two lists and want to reorder them together, zip + sort is your friend.

features = [3.2, 1.5, 4.8]
labels   = ['C', 'A', 'B']

paired = list(zip(features, labels))
paired_sorted = sorted(paired, key=lambda x: x[0])  # sorts by feature
features_sorted, labels_sorted = zip(*paired_sorted)

If you used this kind of pairing earlier with enumerate/zip, you already know how neat this is.

Multi-key sorts and stability — why order of operations matters

Python's sort is stable — when keys are equal, original order is preserved. You can leverage stability for multi-key sorts by sorting by the least significant key first, then the next, or by returning a tuple of keys.

rows = [
    ('NY', 2019, 10),
    ('NY', 2020, 9),
    ('CA', 2020, 12),
    ('CA', 2019, 15)
]

# Sort by state then year (both ascending)
sorted_by_state_year = sorted(rows, key=lambda r: (r[0], r[1]))

# Or stable two-pass approach: year then state
rows_copy = rows[:]  # if you care about original
rows_copy.sort(key=lambda r: r[1])   # year
rows_copy.sort(key=lambda r: r[0])   # state => final sort is state, ties resolved by year

Tuple keys are concise and common in data work.

Case-insensitive sorts, lengths, and other practical keys

Case-insensitive string sort: key=str.lower
Sort by string length: key=len
Complex key: count of missing values in a row: key=lambda r: sum(1 for v in r if v is None)

names = ['alice', 'Bob', 'Álvaro']
# case-insensitive
sorted(names, key=str.lower)

Avoid redundant work: Decorate-Sort-Undecorate (but Python already does it)

If your key is expensive, you want it computed once per item. Good news: Python's sort already computes key(item) once per element. So this is already implemented for you — no need to manually decorate unless you're doing something unusual or using a custom comparator.

If you do need advanced control (e.g., comparator-based logic), use functools.cmp_to_key.

from functools import cmp_to_key

def compare_by_weird_rule(a, b):
    # returns negative if a<b, positive if a>b, zero if equal
    return (len(a) - len(b))

sorted_strings = sorted(strings, key=cmp_to_key(compare_by_weird_rule))

But prefer key= whenever possible for clarity and speed.

Top-k: use heapq for efficiency

If you only need the top 10 items, don't sort the entire dataset (O(n log n)). Use heapq.nlargest / nsmallest (O(n log k)).

import heapq

vals = [random.random() for _ in range(1000000)]
# top 5
top5 = heapq.nlargest(5, vals)

# top 5 by a key on complex objects
top5_people = heapq.nlargest(5, people, key=lambda p: p['age'])

Sorting dictionaries: most common pattern — sort by value

Dictionaries are unsorted (py3.7+ preserves insertion order but that's different). To rank keys by values:

scores = {'a': 10, 'b': 3, 'c': 30}
# keys sorted by value desc
ranked_keys = sorted(scores, key=scores.get, reverse=True)

# or items sorted by value
ranked_items = sorted(scores.items(), key=lambda kv: kv[1], reverse=True)

If you need the top-k keys, combine with heapq.nlargest: heapq.nlargest(k, scores, key=scores.get)

Generators and sorting — materialize first

Generators are lazy. sorted(generator) will consume the generator and sort the results; you cannot sort a generator without materializing it. If memory is a concern, consider streaming top-k (heapq) or external sorting.

Practical tips & gotchas

Key functions are called once per element by Python's sort — use that to your advantage.
Use operator.itemgetter/attrgetter for readability and speed.
For descending numeric sorts: prefer reverse=True instead of returning negative numbers (clearer).
Sort stability is your friend for multi-step sorts and tie-breaking.
For huge datasets where memory is limited: use heapq for top-k or external merge sorts.

Quick reference examples

Sort list of tuples by second element: sorted(lst, key=itemgetter(1))
Case-insensitive sort: sorted(names, key=str.lower)
Top-5 by score: heapq.nlargest(5, people, key=lambda p: p['score'])

Key takeaways — what to remember

Use key= to tell Python what to sort by; it’s called once per item, so it's efficient.
Prefer operator helpers and tuple-keys for clarity.
For parallel arrays, zip them, sort the pairs, then unzip.
Use heapq for top-k to avoid full sorts.
Remember sort stability for multi-key logic — it lets you build complex ordering with simple steps.

"Sorting well is like arranging your life: do the expensive decisions once, keep order stable, and don’t re-sort the whole universe when you just want the top five."

Go forth and sort responsibly — your dashboards and models will thank you.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics