Data Structures and Iteration
Use Python collections and iteration patterns to write expressive, efficient, and readable data-oriented code.
Content
Iterables and Iterators
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Iterables and Iterators — the Rhythm Section of Python Data Work
"If a list is a playlist, an iterator is the DJ who plays one song at a time." — Your friendly, slightly dramatic TA
You're coming from slicing and views and sets and set operations, so you already know how Python stores and manipulates collections. Now we zoom in on how Python walks through those collections: iterables and iterators. This is the choreography behind for-loops, comprehensions, generator expressions, and many memory-efficient data patterns used in data science and AI.
Why this matters for Data Science
- Large datasets: you often can't load everything into memory — iteration lets you process rows, batches, or streams lazily.
- Pipelines: libraries like pandas, itertools, and many ML data loaders use iterators to build memory-efficient data flows.
- Clarity & control: understanding when Python creates copies (like slicing) vs. streams (iterators) helps avoid performance surprises.
This builds on the earlier discussion of views vs copies: views reduce memory by referencing the same block; iterators reduce memory by producing items on demand.
Quick definitions (no fluff)
- Iterable: any Python object you can loop over (has iter or implements the sequence protocol). Examples: list, tuple, set, dict, string, range, generator.
- Iterator: an object that produces the next value when asked (implements next and iter returning itself).
Micro explanation
- Iterable = the playlist (a collection of songs).
- Iterator = the DJ (keeps track of what's next, plays and moves forward).
When you call iter(my_iterable) you get an iterator (a DJ starts spinning). When the iterator runs out, it raises StopIteration — that's the DJ saying "the night is over."
How they connect (the protocol)
- iterable.iter() -> returns an iterator
- iterator.next() -> returns next item or raises StopIteration
Python's for-loop hides this: it calls iter(...) once and repeatedly calls next(...) until StopIteration appears.
# simple iterator usage
numbers = [10, 20, 30]
it = iter(numbers) # get the iterator
print(next(it)) # 10
print(next(it)) # 20
print(next(it)) # 30
# next(it) now -> StopIteration
Built-in iterables vs explicit iterators
- Sequences like lists and tuples are iterables that create a fresh iterator each time you call iter(). That means you can loop multiple times safely.
- Generators are iterators (they implement next and return themselves from iter). They maintain internal state and get exhausted after one pass.
Example contrast:
lst = [1,2,3]
for a in lst:
print(a)
for b in lst: # works again — list returned a new iterator
print(b)
gen = (x*x for x in range(3))
for a in gen:
print(a)
for b in gen: # prints nothing — generator exhausted
print(b)
Why generators are your memory-saving friends
Generators yield one item at a time. For a CSV with millions of rows, a generator-based reader lets you stream rows instead of loading the whole file.
Example: streaming lines from a file
# memory-efficient file processing
with open('big.csv') as f:
for line in f: # file object is an iterator
process(line)
Or create your own generator for batches (very useful for ML training pipelines):
def batcher(iterable, batch_size):
it = iter(iterable)
while True:
batch = []
try:
for _ in range(batch_size):
batch.append(next(it))
except StopIteration:
if batch:
yield batch
break
yield batch
Common pitfalls & gotchas (learn them so you don't cry later)
- Exhaustion: generators and many iterators are single-use. If you need multiple passes, either store results (if small) or recreate the iterator.
- Mutating while iterating: changing a list while looping can produce odd behavior. Prefer iterating a copy (or use range with indices).
- Multiple iter() on same object: for some objects (like file objects) calling iter() returns the same iterator; for sequences it returns a fresh one.
Remember from sets: sets are unordered — iterating a set yields elements in some arbitrary order. Don't rely on iteration order unless the type guarantees it (lists, tuples, dict from Python 3.7+ preserves insertion order).
Handy tools in the itertools toolbox
itertools is basically the Swiss Army knife for iterables. A few favorites:
- itertools.islice — slice an iterator without consuming an underlying sequence into memory
- itertools.chain — treat multiple iterables as one
- itertools.groupby — group consecutive items (careful: requires sorted input)
- itertools.tee — duplicate an iterator (uses internal buffering; not magic)
Example: take the first 10 items from a potentially infinite iterator:
import itertools
infinite = (i for i in range(1000000000))
first10 = list(itertools.islice(infinite, 10))
Practical mini-workflow: reading, filtering, batching
Imagine a pipeline: read lines -> parse -> filter -> batch -> train. Each step should prefer iterators/generators to stay memory efficient.
def parse_lines(f):
for line in f:
yield line.strip().split(',')
def filter_valid(rows):
for r in rows:
if is_valid(r):
yield r
with open('data.csv') as fh:
rows = parse_lines(fh)
good = filter_valid(rows)
for batch in batcher(good, 128):
train_on(batch)
This pattern avoids loading entire files and composes cleanly.
Key takeaways — what to remember
- Iterable = can be looped over. Iterator = produces values one at a time.
- Generators are iterators and are single-pass — excellent for memory savings.
- Use iterators for streaming data and pipelines; use sequences when you need random access or repeated passes.
- itertools is your friend for advanced iteration patterns.
This is the moment where the concept finally clicks: iteration is not just how you write loops — it's how you think about data flow.
Quick checklist before you code
- Do I need multiple passes? If yes, avoid single-use generators or regenerate/store results.
- Is memory a concern? Favor iterators and generators.
- Do I rely on order? Use a sequence or explicitly sort.
Final tiny brain hack: when you write for x in y, mentally translate it to:
- it = iter(y)
- call next(it) repeatedly
- handle StopIteration
Once you see the loop as a stateful DJ playing one record at a time, you'll start writing pipelines that scale instead of scripts that crash.
Happy iterating! (And remember: the DJ controls the flow.)
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!