Data Structures and Iteration
Use Python collections and iteration patterns to write expressive, efficient, and readable data-oriented code.
Content
Generators and yield
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Generators and yield — the lazy engines of Python
"Remember iterables and iterators? Generators are their cooler, low-memory cousin who drinks espresso and only produces things when you ask."
You're already familiar with iterables and iterators (we covered that just before this) and you've seen how slicing and views give cheap windows into data. Now meet generators: the one-pass, lazy sequences that let you process huge data without turning your machine into a sad, swapped-out potato.
What are generators, and why they matter for data work
- Generator: a callable that yields values one at a time, pausing its state between values. It's an iterator (so it implements the iterator protocol: iter and next) but it's created more concisely.
- Why care? For data science and AI workflows we often handle streams: massive CSVs, log files, simulation outputs. Generators let you process these streams without loading everything into memory.
Think of a generator as a water tap: you get water (values) when you open it; you don't have to fill the bathtub (list) first.
Generator function vs generator expression
Generator function (with yield)
Use when you need multiple lines, state, or complex logic.
def gen_fibonacci(n):
a, b = 0, 1
for _ in range(n):
yield a
a, b = b, a + b
# Usage
for x in gen_fibonacci(5):
print(x) # 0 1 1 2 3
Generator expression (like a list comprehension but lazy)
Great for simple transformations.
squares = (x*x for x in range(10))
print(next(squares)) # 0
Micro explanation: Both produce iterators. The generator expression is syntactic sugar; the function with yield gives you full control.
yield vs return — the simple but crucial difference
- return exits a function and hands back a value once.
- yield pauses the function, returns a value, and keeps the local state so the function can resume later.
"yield is 'pause and remember', return is 'bye forever'."
This is why generators can produce an unbounded series of values while keeping only a small working set in memory.
Key properties & things to remember
- Generators are lazy: values are produced only when requested (via next() or a for-loop).
- They are single-pass: once exhausted, you must recreate them to iterate again.
- They usually don't support slicing/indexing (no len, no getitem). Use itertools.islice if you need 'slicing' behavior.
- They keep state between yields (local variables remain intact).
Example of exhaustion:
g = (i for i in range(3))
print(list(g)) # [0, 1, 2]
print(list(g)) # [] -- exhausted
Use-case tip: If you need to peek or reuse data, convert to list (costly) or design upstream to replay the generator source.
Real-world data examples (because theory without coffee is sad)
- Streaming a large CSV without loading all rows:
def read_large_csv(path):
with open(path, 'r') as f:
for line in f:
yield line.rstrip('\n').split(',')
for row in read_large_csv('big.csv'):
process(row) # rows are handled one-by-one
- Cleaning pipeline using chained generators (memory-friendly):
def read_lines(path):
with open(path) as f:
yield from f
def filter_comments(lines):
for ln in lines:
if not ln.startswith('#'):
yield ln
def parse_csv(lines):
for ln in lines:
yield ln.strip().split(',')
lines = read_lines('big.csv')
rows = parse_csv(filter_comments(lines))
for row in rows:
analyze(row)
This pipeline uses tiny amounts of memory and composes cleanly. Contrast with reading the whole file into a list and transforming it — that eats RAM.
Helpful stdlib allies
- itertools — tools to slice, chain, group, and more (e.g., islice, chain, tee, groupby).
- heapq — can merge sorted generator outputs (merge).
- pandas.read_csv(..., chunksize=...) — returns an iterator of DataFrame chunks (under the same lazy philosophy).
Example: getting the first 10 items of a generator:
from itertools import islice
first10 = list(islice(my_generator(), 10))
Note: itertools.tee can create independent iterators from one generator, but it buffers data internally — so it may increase memory usage.
Advanced-ish features (briefly) — send, throw, close
Generators are more than simple value streams:
- .send(value) can resume the generator and inject data into it (useful for coroutines).
- .throw(exc) throws an exception at the yield point.
- .close() stops it.
These are powerful (and a little magical). For typical data processing you rarely need send/throw; they're more common in asynchronous coroutine patterns.
Comparing generators with lists and views (ties to previous topics)
- Lists: random access, len, memory-hungry. Use when you need to index or iterate multiple times.
- Views / slicing (like numpy / pandas views): cheap windows into existing memory. Good for subsetting without copying.
- Generators: single-pass, lazy, minimal memory. Good for streaming or pipelines.
So: if slicing/views let you avoid copying big arrays, generators avoid loading them at all.
Common pitfalls and how to avoid them
- Accidentally exhausting a generator by converting to list early.
- Fix: only convert when necessary, or recreate generator when needed again.
- Assuming len() or indexing works. It doesn't.
- Use islice or collect into list if you absolutely need slicing.
- Using itertools.tee indiscriminately — it buffers and can use lots of memory.
Quick takeaways — the bits to tattoo on your brain
- Generators = lazy, single-pass iterators. Use them to process large or infinite sequences without big memory bills.
- yield pauses and saves state — unlike return.
- Combine generator functions and generator expressions for readable, memory-efficient pipelines.
- When you need random access, length, or multiple passes, use lists or views instead.
"If your data doesn't fit in memory, generators let you keep working. If it does, they still make your code elegant — and your machine less sad."
Next steps (practice)
- Convert a list-processing script into a generator pipeline.
- Use itertools.islice to sample the first N items from a generator.
- Try reading a 1GB CSV via a generator and calculating column stats without pandas (then compare performance).
If you enjoyed this, the natural next step is exploring coroutine patterns and async generators — powerful for streaming data and networked ML pipelines.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!