Data Structures and Iteration
Use Python collections and iteration patterns to write expressive, efficient, and readable data-oriented code.
Content
Sets and Set Operations
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Sets and Set Operations — Python's Unordered, Deduplicating Superpower
"Remember dictionaries from last time? Sets are like the dictionary keys-only club: fast, unique, and a bit antisocial."
You're already comfortable with tuples (immutability vibes) and dictionaries (key-based magic & dict comprehensions). Sets slot neatly between them: they give you unique, hash-based, unordered collections and a beautiful toolbox of mathematical operations (union, intersection, difference) that make many data tasks delightfully simple.
Why sets matter in data work
- Remove duplicates quickly (de-dup lists of IDs, emails, or labels).
- Fast membership checks: testing if x in collection is typically O(1) average time — same principle as dict keys.
- Relationship math: intersections and differences map directly to questions like "which users are in A and B?" or "who's only in A but not B?" — common in feature engineering, label comparisons, and exploratory data analysis.
Quick reminder: what sets are
- A set is unordered — no index-based access.
- Elements must be hashable (so no lists, but tuples are fine).
- Mutable by default (add/remove), but there is an immutable cousin: frozenset (hello, tuple sibling).
Basic set operations (with code you can brag about)
Python literal:
# create sets
s = {1, 2, 3}
empty = set() # {} makes an empty dict, not an empty set
# add / remove
s.add(4) # {1,2,3,4}
s.remove(2) # KeyError if missing
s.discard(9) # no error if missing
# membership
if 3 in s:
print('fast check')
# set comprehension (like dict comprehensions, but sety)
squares = {x*x for x in range(6)} # {0,1,4,9,16,25}
Note the set comprehension — think of it as the extroverted cousin of dict comprehensions you met earlier.
The mathematical ops (read: your new best friends)
Assume A and B are sets.
- Union: A | B — everything in A or B
- Intersection: A & B — items in both A and B
- Difference: A - B — items in A but not in B
- Symmetric difference: A ^ B — items in A or B but not both
A = {'alice', 'bob', 'carol'}
B = {'bob', 'dave'}
A | B # {'alice','bob','carol','dave'}
A & B # {'bob'}
A - B # {'alice','carol'}
A ^ B # {'alice','carol','dave'}
Micro explanation: intersection answers "who is shared?" — perfect for comparing two label sets or user cohorts.
Real-world mini use cases
- Deduplicate email list fast:
emails = ['a@x.com','b@y.com','a@x.com']
unique_emails = list(set(emails))
- Find common customers between two campaigns:
campaign_A = set(df_A.customer_id)
campaign_B = set(df_B.customer_id)
common = campaign_A & campaign_B
- Find features present in one dataset but missing in another:
features_train = set(train.columns)
features_test = set(test.columns)
missing_in_test = features_train - features_test
These are exactly the kinds of practical, repetitive tasks you used to write 10-line loops for — now one set op does it.
Complexity & performance (you asked for this in a whisper)
- Membership (x in s): average O(1) — same reason dict lookups are fast: hashing.
- Add / remove: average O(1).
- Set operations scale roughly O(len(A) + len(B)) for many operations (they iterate under the hood).
So when you need to check membership for thousands of items repeatedly, sets are dramatically faster than lists.
Pitfalls & gotchas — because debugging is character building
- Unhashable elements: lists, dicts inside a set? Not allowed. Use tuples or frozenset for nested collections.
# this fails
# bad = {[1,2], [3,4]}
# this works
good = {tuple([1,2]), tuple([3,4])}
# or for nested sets
nested = {frozenset({1,2}), frozenset({3,4})}
- Order is not preserved: converting set -> list gives arbitrary order. If deterministic output is needed, sort it: sorted(set_obj).
- Empty brackets {} create dicts: remember to use set() for empty sets.
- Mutable elements: don't attempt to put mutable objects in sets; you'll get a TypeError.
Also, remember tuples are immutable — like the calm, reliable sibling. If you need an immutable set (e.g., as a dict key), use frozenset. That's where your knowledge of tuples/immutability helps: immutability enables hashing.
Advanced notes: frozenset & using sets as keys
- Use frozenset when you need a set-like object that is itself hashable (e.g., as a key in a dict or an element of another set).
s = frozenset([1,2,3])
mydict = {s: 'value'}
This is handy in caching set operations or memoizing results keyed by a group of items.
Quick exercises (try these in a notebook)
- Given two lists of product IDs, write a one-liner to get IDs present in both lists and sorted.
- Convert a list of tuples representing edges into a set of frozensets so that edge order doesn't matter ({(a,b)} equivalent to {(b,a)}).
- Use a set comprehension to create the set of lowercase unique words from a sentence.
Answers (no peeking until you've tried):
- sorted(set(list1) & set(list2))
- {frozenset(edge) for edge in edges}
- {w.lower() for w in sentence.split()}
Key takeaways (the tiny chant you will whisper before coding)
- Sets = unique, unordered collections for fast membership and relation math.
- Use set operations for union/intersection/difference tasks — they replace messy loops with clear intent.
- Remember hashability: tuples and frozensets are your friends; lists are not.
- When you need immutability or dict keys from a set, use frozenset — tie-back to tuples & immutability from the previous section.
"If dictionaries are the social network of Python data structures, sets are the private chat: exclusive members only, fast to check who’s in, and great for finding overlaps."
Go forth and de-duplicate with confidence. Your future self, and your dataset, will thank you.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!