Data Structures and Iteration
Use Python collections and iteration patterns to write expressive, efficient, and readable data-oriented code.
Content
Sorting and Custom Keys
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Sorting and Custom Keys — Make Python Sort Like a Data Scientist
Ever noticed how a leaderboard can be beautifully ordered one moment and catastrophically wrong the next because someone used lexicographic sorting on numbers? Welcome to the low-key chaos of sorting data. We're going to tame it.
"Sorting is the quiet magician of data pipelines — it doesn't get applause, but everything breaks without it."
This lesson assumes you already know how to iterate with tools like enumerate, zip, and how to produce data lazily with generators. We'll build on that: use enumerate to keep original indices, zip to sort parallel arrays, and remember that generators need to be materialized (list(...)) before sorting.
What this page covers
- How Python's sorted() and list.sort() use key functions to control ordering.
- Practical patterns for data science: sorting dicts, rankings, top-k, stable multi-key sorts.
- Performance tips (decorate-sort-undecorate, operator helpers, heapq for top-k).
Why this matters: sorting underpins ranking, sampling, deduplication, and feature engineering. If you sort wrong, your model gets garbage, your dashboard lies, and your product manager cries.
The basics: sorted(), list.sort(), and the key parameter
- sorted(iterable, key=..., reverse=False) returns a new list.
- list.sort(key=..., reverse=False) sorts in-place and is slightly faster and memory-friendlier.
The magic is the key function: Python computes key(item) for each item once and sorts by those keys. That single-call-per-item behavior is a performance superpower — use it.
Example: sort people by age
people = [
{'name': 'Ana', 'age': 34},
{'name': 'Ben', 'age': 28},
{'name': 'Cara', 'age': 42}
]
# ascending by age
sorted_people = sorted(people, key=lambda p: p['age'])
# descending by age
sorted_people_desc = sorted(people, key=lambda p: p['age'], reverse=True)
Fast alternative with operator.itemgetter
Use operator.itemgetter when extracting dict keys or tuple indices; it's slightly faster and clearer.
from operator import itemgetter
sorted_people = sorted(people, key=itemgetter('age'))
Sorting parallel lists — zip to the rescue
If you have features and labels in two lists and want to reorder them together, zip + sort is your friend.
features = [3.2, 1.5, 4.8]
labels = ['C', 'A', 'B']
paired = list(zip(features, labels))
paired_sorted = sorted(paired, key=lambda x: x[0]) # sorts by feature
features_sorted, labels_sorted = zip(*paired_sorted)
If you used this kind of pairing earlier with enumerate/zip, you already know how neat this is.
Multi-key sorts and stability — why order of operations matters
Python's sort is stable — when keys are equal, original order is preserved. You can leverage stability for multi-key sorts by sorting by the least significant key first, then the next, or by returning a tuple of keys.
rows = [
('NY', 2019, 10),
('NY', 2020, 9),
('CA', 2020, 12),
('CA', 2019, 15)
]
# Sort by state then year (both ascending)
sorted_by_state_year = sorted(rows, key=lambda r: (r[0], r[1]))
# Or stable two-pass approach: year then state
rows_copy = rows[:] # if you care about original
rows_copy.sort(key=lambda r: r[1]) # year
rows_copy.sort(key=lambda r: r[0]) # state => final sort is state, ties resolved by year
Tuple keys are concise and common in data work.
Case-insensitive sorts, lengths, and other practical keys
- Case-insensitive string sort: key=str.lower
- Sort by string length: key=len
- Complex key: count of missing values in a row: key=lambda r: sum(1 for v in r if v is None)
names = ['alice', 'Bob', 'Álvaro']
# case-insensitive
sorted(names, key=str.lower)
Avoid redundant work: Decorate-Sort-Undecorate (but Python already does it)
If your key is expensive, you want it computed once per item. Good news: Python's sort already computes key(item) once per element. So this is already implemented for you — no need to manually decorate unless you're doing something unusual or using a custom comparator.
If you do need advanced control (e.g., comparator-based logic), use functools.cmp_to_key.
from functools import cmp_to_key
def compare_by_weird_rule(a, b):
# returns negative if a<b, positive if a>b, zero if equal
return (len(a) - len(b))
sorted_strings = sorted(strings, key=cmp_to_key(compare_by_weird_rule))
But prefer key= whenever possible for clarity and speed.
Top-k: use heapq for efficiency
If you only need the top 10 items, don't sort the entire dataset (O(n log n)). Use heapq.nlargest / nsmallest (O(n log k)).
import heapq
vals = [random.random() for _ in range(1000000)]
# top 5
top5 = heapq.nlargest(5, vals)
# top 5 by a key on complex objects
top5_people = heapq.nlargest(5, people, key=lambda p: p['age'])
Sorting dictionaries: most common pattern — sort by value
Dictionaries are unsorted (py3.7+ preserves insertion order but that's different). To rank keys by values:
scores = {'a': 10, 'b': 3, 'c': 30}
# keys sorted by value desc
ranked_keys = sorted(scores, key=scores.get, reverse=True)
# or items sorted by value
ranked_items = sorted(scores.items(), key=lambda kv: kv[1], reverse=True)
If you need the top-k keys, combine with heapq.nlargest: heapq.nlargest(k, scores, key=scores.get)
Generators and sorting — materialize first
Generators are lazy. sorted(generator) will consume the generator and sort the results; you cannot sort a generator without materializing it. If memory is a concern, consider streaming top-k (heapq) or external sorting.
Practical tips & gotchas
- Key functions are called once per element by Python's sort — use that to your advantage.
- Use operator.itemgetter/attrgetter for readability and speed.
- For descending numeric sorts: prefer reverse=True instead of returning negative numbers (clearer).
- Sort stability is your friend for multi-step sorts and tie-breaking.
- For huge datasets where memory is limited: use heapq for top-k or external merge sorts.
Quick reference examples
- Sort list of tuples by second element: sorted(lst, key=itemgetter(1))
- Case-insensitive sort: sorted(names, key=str.lower)
- Top-5 by score: heapq.nlargest(5, people, key=lambda p: p['score'])
Key takeaways — what to remember
- Use key= to tell Python what to sort by; it’s called once per item, so it's efficient.
- Prefer operator helpers and tuple-keys for clarity.
- For parallel arrays, zip them, sort the pairs, then unzip.
- Use heapq for top-k to avoid full sorts.
- Remember sort stability for multi-key logic — it lets you build complex ordering with simple steps.
"Sorting well is like arranging your life: do the expensive decisions once, keep order stable, and don’t re-sort the whole universe when you just want the top five."
Go forth and sort responsibly — your dashboards and models will thank you.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!