Courses/Python for Data Science, AI & Development/Data Structures and Iteration

Data Structures and Iteration

41534 views

Use Python collections and iteration patterns to write expressive, efficient, and readable data-oriented code.

Content

4 of 15

Sets and Set Operations

Python Sets and Set Operations: Fast Unique Values & Ops

5423 views

beginner

humorous

data-science

python

sets

gpt-5-mini

5423 views

Versions:

Python Sets and Set Operations: Fast Unique Values & Ops

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Sets and Set Operations — Python's Unordered, Deduplicating Superpower

"Remember dictionaries from last time? Sets are like the dictionary keys-only club: fast, unique, and a bit antisocial."

You're already comfortable with tuples (immutability vibes) and dictionaries (key-based magic & dict comprehensions). Sets slot neatly between them: they give you unique, hash-based, unordered collections and a beautiful toolbox of mathematical operations (union, intersection, difference) that make many data tasks delightfully simple.

Why sets matter in data work

Remove duplicates quickly (de-dup lists of IDs, emails, or labels).
Fast membership checks: testing if x in collection is typically O(1) average time — same principle as dict keys.
Relationship math: intersections and differences map directly to questions like "which users are in A and B?" or "who's only in A but not B?" — common in feature engineering, label comparisons, and exploratory data analysis.

Quick reminder: what sets are

A set is unordered — no index-based access.
Elements must be hashable (so no lists, but tuples are fine).
Mutable by default (add/remove), but there is an immutable cousin: frozenset (hello, tuple sibling).

Basic set operations (with code you can brag about)

Python literal:

# create sets
s = {1, 2, 3}
empty = set()  # {} makes an empty dict, not an empty set

# add / remove
s.add(4)       # {1,2,3,4}
s.remove(2)    # KeyError if missing
s.discard(9)   # no error if missing

# membership
if 3 in s:
    print('fast check')

# set comprehension (like dict comprehensions, but sety)
squares = {x*x for x in range(6)}  # {0,1,4,9,16,25}

Note the set comprehension — think of it as the extroverted cousin of dict comprehensions you met earlier.

The mathematical ops (read: your new best friends)

Assume A and B are sets.

Union: A | B — everything in A or B
Intersection: A & B — items in both A and B
Difference: A - B — items in A but not in B
Symmetric difference: A ^ B — items in A or B but not both

A = {'alice', 'bob', 'carol'}
B = {'bob', 'dave'}

A | B        # {'alice','bob','carol','dave'}
A & B        # {'bob'}
A - B        # {'alice','carol'}
A ^ B        # {'alice','carol','dave'}

Micro explanation: intersection answers "who is shared?" — perfect for comparing two label sets or user cohorts.

Real-world mini use cases

Deduplicate email list fast:

emails = ['a@x.com','b@y.com','a@x.com']
unique_emails = list(set(emails))

Find common customers between two campaigns:

campaign_A = set(df_A.customer_id)
campaign_B = set(df_B.customer_id)
common = campaign_A & campaign_B

Find features present in one dataset but missing in another:

features_train = set(train.columns)
features_test  = set(test.columns)
missing_in_test = features_train - features_test

These are exactly the kinds of practical, repetitive tasks you used to write 10-line loops for — now one set op does it.

Complexity & performance (you asked for this in a whisper)

Membership (x in s): average O(1) — same reason dict lookups are fast: hashing.
Add / remove: average O(1).
Set operations scale roughly O(len(A) + len(B)) for many operations (they iterate under the hood).

So when you need to check membership for thousands of items repeatedly, sets are dramatically faster than lists.

Pitfalls & gotchas — because debugging is character building

Unhashable elements: lists, dicts inside a set? Not allowed. Use tuples or frozenset for nested collections.

# this fails
# bad = {[1,2], [3,4]}

# this works
good = {tuple([1,2]), tuple([3,4])}
# or for nested sets
nested = {frozenset({1,2}), frozenset({3,4})}

Order is not preserved: converting set -> list gives arbitrary order. If deterministic output is needed, sort it: sorted(set_obj).
Empty brackets {} create dicts: remember to use set() for empty sets.
Mutable elements: don't attempt to put mutable objects in sets; you'll get a TypeError.

Also, remember tuples are immutable — like the calm, reliable sibling. If you need an immutable set (e.g., as a dict key), use frozenset. That's where your knowledge of tuples/immutability helps: immutability enables hashing.

Advanced notes: frozenset & using sets as keys

Use frozenset when you need a set-like object that is itself hashable (e.g., as a key in a dict or an element of another set).

s = frozenset([1,2,3])
mydict = {s: 'value'}

This is handy in caching set operations or memoizing results keyed by a group of items.

Quick exercises (try these in a notebook)

Given two lists of product IDs, write a one-liner to get IDs present in both lists and sorted.
Convert a list of tuples representing edges into a set of frozensets so that edge order doesn't matter ({(a,b)} equivalent to {(b,a)}).
Use a set comprehension to create the set of lowercase unique words from a sentence.

Answers (no peeking until you've tried):

sorted(set(list1) & set(list2))
{frozenset(edge) for edge in edges}
{w.lower() for w in sentence.split()}

Key takeaways (the tiny chant you will whisper before coding)

Sets = unique, unordered collections for fast membership and relation math.
Use set operations for union/intersection/difference tasks — they replace messy loops with clear intent.
Remember hashability: tuples and frozensets are your friends; lists are not.
When you need immutability or dict keys from a set, use frozenset — tie-back to tuples & immutability from the previous section.

"If dictionaries are the social network of Python data structures, sets are the private chat: exclusive members only, fast to check who’s in, and great for finding overlaps."

Go forth and de-duplicate with confidence. Your future self, and your dataset, will thank you.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics