Data Structures and Iteration
Use Python collections and iteration patterns to write expressive, efficient, and readable data-oriented code.
Content
Dictionaries and Dict Comprehensions
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Dictionaries and Dict Comprehensions — Fast, Friendly, and Functional
"If lists are grocery bags and tuples are sealed Tupperware, dictionaries are labeled spice jars — wildly useful when you need to find the thing that matches a name."
You're coming from Python Foundations for Data Work and have already met lists (and their shiny list comprehensions) and tuples (those immutably reliable friends). Now we move to the structure that makes lookups instantaneous and your code smell less like a dumpster fire: dictionaries and dict comprehensions.
What is a dictionary and why it matters for data work
- Dictionary: a mutable mapping of keys → values. Keys must be hashable (strings, numbers, tuples... not lists).
- Where it appears in data tasks:
- Feature lookup: map category → index or one-hot vector
- Frequency tables: token → count
- Metadata: column_name → dtype / normalization factor
- Fast joins/merges when you don't want the overhead of pandas
If lists are great for ordered sequences and tuples guarantee safety (immutability), dictionaries are unbeatable for keyed access — O(1) average-time lookups. That’s why they’re everywhere in data pipelines.
Quick reminders from earlier (lists & tuples)
- You used list comprehensions to transform sequences: [x**2 for x in nums]. Expect the same elegant expressiveness with dict comprehensions: {k: v for ...}.
- Tuples can be used as dictionary keys because they’re immutable; lists cannot.
Basic dictionary usage (the easy bits)
Create from literals or two lists:
# literal
d = {'a': 1, 'b': 2}
# from two lists
cols = ['id', 'name', 'age']
values = [101, 'Ada', 29]
row = dict(zip(cols, values)) # {'id':101,'name':'Ada','age':29}
Access safely:
# may raise KeyError
x = d['c']
# safe with default
x = d.get('c', 0)
Update/merge:
d.update({'b': 3, 'c': 4})
# or Python 3.5+: new_d = {**d, **other}
Iteration patterns — choose your weapon:
for k in d: # keys
for v in d.values(): # values
for k, v in d.items(): # both
for i, (k, v) in enumerate(d.items()): # index + items
Sort while iterating:
for k in sorted(d):
print(k, d[k])
Dict comprehensions: list comprehension's wilder cousin
Syntax mirrors list comprehensions but builds a mapping:
# basic: feature -> normalized value
counts = {'a': 3, 'b': 7, 'c': 0}
total = sum(counts.values())
norm = {k: v/total for k, v in counts.items()}
Filter while building:
# keep only frequent features
freq_filtered = {k: v for k, v in counts.items() if v >= 2}
Conditionals inside values:
# bucketize
buckets = {k: ('high' if v > 5 else 'low') for k, v in counts.items()}
Nested comprehensions (grouping/inverting):
# invert mapping: value -> list of keys that had that value
inv = {}
for k, v in d.items():
inv.setdefault(v, []).append(k)
# or using dict + list comprehension (less efficient):
inv = {v: [k for k, val in d.items() if val == v] for v in set(d.values())}
When to prefer dict comprehension: when you can build the mapping in a single, readable expression. If you need complex aggregation, a loop or collections.defaultdict/Counter is often clearer.
Data-science flavored examples (so you can flex in notebooks)
- Map categorical values to indices (useful before feeding into models):
cats = ['apple', 'banana', 'apple', 'cherry']
cat_to_idx = {cat: i for i, cat in enumerate(sorted(set(cats)))}
# {'apple': 0, 'banana': 1, 'cherry': 2}
- Frequency counts — idiomatic way (Counter) vs manual dict:
from collections import Counter
Counter(cats) # quickest
# manual (good exercise):
counts = {}
for c in cats:
counts[c] = counts.get(c, 0) + 1
# Normalize with dict comprehension
normalized = {k: v/sum(counts.values()) for k, v in counts.items()}
- Feature engineering — rename columns
raw_cols = ['Age (yrs)', 'Salary USD']
clean = {c: c.lower().replace(' ', '_').replace('(', '').replace(')', '')
for c in raw_cols}
# {'Age (yrs)': 'age_yrs', 'Salary USD': 'salary_usd'}
Advanced tips & gotchas
- Keys must be hashable: strings, numbers, tuples ok; lists and dicts not allowed.
- If you need multiple values per key, store lists or use defaultdict(list).
- Performance: dict lookups are O(1) on average — perfect for joins and lookups.
- Beware colliding keys when merging: later keys overwrite earlier ones.
- For frequency tasks, prefer collections.Counter or defaultdict for clarity and speed.
Quick comparison: list vs tuple vs dict (in one glance)
- List: ordered, mutable — good for sequences
- Tuple: ordered, immutable — safe as dict keys
- Dictionary: unordered mapping key→value — fast lookups and labeled data
Best practices for data projects
- Use dict comprehensions for readable mapping transforms and small lookups.
- Use Counter/defaultdict for aggregations; use comprehensions for final transformations.
- Keep keys simple and consistent (strings or tuples). Keys that are objects can be fragile when pickling or across sessions.
- Document what keys mean — dictionaries are flexible but can become cryptic messes if keys are used inconsistently.
Key takeaways
- Dictionaries are the go-to structure for labeled, fast-access data.
- Dict comprehensions give you declarative power like list comprehensions, letting you map and filter in one line.
- Use tuple keys when you need composite keys (they're immutable and hashable). Use defaultdict/Counter when aggregating.
"Think of a dict as the indexed index of your data — you can call things by name instead of rummaging through every row."
Go practice: convert a CSV header & row into a dict (zip), then write a dict comprehension to normalize numeric columns and filter out low-quality features. That combo bridges your Python Foundations into real, clean data work.
Happy mapping. When in doubt, enumerate + items() + a bit of comprehension will rescue 90% of your code smell.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!