jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Python for Data Science, AI & Development
Chapters

1Python Foundations for Data Work

2Data Structures and Iteration

Lists and List ComprehensionsTuples and ImmutabilityDictionaries and Dict ComprehensionsSets and Set OperationsSlicing and ViewsIterables and IteratorsGenerators and yieldEnumerate and ZipSorting and Custom KeysLambda FunctionsMap, Filter, Reduce*args and **kwargsRecursion vs IterationTime Complexity BasicsType Hints and dataclasses

3Numerical Computing with NumPy

4Data Analysis with pandas

5Data Cleaning and Feature Engineering

6Data Visualization and Storytelling

7Statistics and Probability for Data Science

8Machine Learning with scikit-learn

9Deep Learning Foundations

10Data Sources, Engineering, and Deployment

Courses/Python for Data Science, AI & Development/Data Structures and Iteration

Data Structures and Iteration

41523 views

Use Python collections and iteration patterns to write expressive, efficient, and readable data-oriented code.

Content

7 of 15

Generators and yield

Generators and yield in Python: Lazy Iteration for Data
3016 views
beginner
python
data-science
generators
gpt-5-mini
3016 views

Versions:

Generators and yield in Python: Lazy Iteration for Data

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Generators and yield — the lazy engines of Python

"Remember iterables and iterators? Generators are their cooler, low-memory cousin who drinks espresso and only produces things when you ask."

You're already familiar with iterables and iterators (we covered that just before this) and you've seen how slicing and views give cheap windows into data. Now meet generators: the one-pass, lazy sequences that let you process huge data without turning your machine into a sad, swapped-out potato.


What are generators, and why they matter for data work

  • Generator: a callable that yields values one at a time, pausing its state between values. It's an iterator (so it implements the iterator protocol: iter and next) but it's created more concisely.
  • Why care? For data science and AI workflows we often handle streams: massive CSVs, log files, simulation outputs. Generators let you process these streams without loading everything into memory.

Think of a generator as a water tap: you get water (values) when you open it; you don't have to fill the bathtub (list) first.


Generator function vs generator expression

Generator function (with yield)

Use when you need multiple lines, state, or complex logic.

def gen_fibonacci(n):
    a, b = 0, 1
    for _ in range(n):
        yield a
        a, b = b, a + b

# Usage
for x in gen_fibonacci(5):
    print(x)  # 0 1 1 2 3

Generator expression (like a list comprehension but lazy)

Great for simple transformations.

squares = (x*x for x in range(10))
print(next(squares))  # 0

Micro explanation: Both produce iterators. The generator expression is syntactic sugar; the function with yield gives you full control.


yield vs return — the simple but crucial difference

  • return exits a function and hands back a value once.
  • yield pauses the function, returns a value, and keeps the local state so the function can resume later.

"yield is 'pause and remember', return is 'bye forever'."

This is why generators can produce an unbounded series of values while keeping only a small working set in memory.


Key properties & things to remember

  • Generators are lazy: values are produced only when requested (via next() or a for-loop).
  • They are single-pass: once exhausted, you must recreate them to iterate again.
  • They usually don't support slicing/indexing (no len, no getitem). Use itertools.islice if you need 'slicing' behavior.
  • They keep state between yields (local variables remain intact).

Example of exhaustion:

g = (i for i in range(3))
print(list(g))  # [0, 1, 2]
print(list(g))  # []  -- exhausted

Use-case tip: If you need to peek or reuse data, convert to list (costly) or design upstream to replay the generator source.


Real-world data examples (because theory without coffee is sad)

  1. Streaming a large CSV without loading all rows:
def read_large_csv(path):
    with open(path, 'r') as f:
        for line in f:
            yield line.rstrip('\n').split(',')

for row in read_large_csv('big.csv'):
    process(row)  # rows are handled one-by-one
  1. Cleaning pipeline using chained generators (memory-friendly):
def read_lines(path):
    with open(path) as f:
        yield from f

def filter_comments(lines):
    for ln in lines:
        if not ln.startswith('#'):
            yield ln

def parse_csv(lines):
    for ln in lines:
        yield ln.strip().split(',')

lines = read_lines('big.csv')
rows = parse_csv(filter_comments(lines))
for row in rows:
    analyze(row)

This pipeline uses tiny amounts of memory and composes cleanly. Contrast with reading the whole file into a list and transforming it — that eats RAM.


Helpful stdlib allies

  • itertools — tools to slice, chain, group, and more (e.g., islice, chain, tee, groupby).
  • heapq — can merge sorted generator outputs (merge).
  • pandas.read_csv(..., chunksize=...) — returns an iterator of DataFrame chunks (under the same lazy philosophy).

Example: getting the first 10 items of a generator:

from itertools import islice
first10 = list(islice(my_generator(), 10))

Note: itertools.tee can create independent iterators from one generator, but it buffers data internally — so it may increase memory usage.


Advanced-ish features (briefly) — send, throw, close

Generators are more than simple value streams:

  • .send(value) can resume the generator and inject data into it (useful for coroutines).
  • .throw(exc) throws an exception at the yield point.
  • .close() stops it.

These are powerful (and a little magical). For typical data processing you rarely need send/throw; they're more common in asynchronous coroutine patterns.


Comparing generators with lists and views (ties to previous topics)

  • Lists: random access, len, memory-hungry. Use when you need to index or iterate multiple times.
  • Views / slicing (like numpy / pandas views): cheap windows into existing memory. Good for subsetting without copying.
  • Generators: single-pass, lazy, minimal memory. Good for streaming or pipelines.

So: if slicing/views let you avoid copying big arrays, generators avoid loading them at all.


Common pitfalls and how to avoid them

  • Accidentally exhausting a generator by converting to list early.
    • Fix: only convert when necessary, or recreate generator when needed again.
  • Assuming len() or indexing works. It doesn't.
    • Use islice or collect into list if you absolutely need slicing.
  • Using itertools.tee indiscriminately — it buffers and can use lots of memory.

Quick takeaways — the bits to tattoo on your brain

  • Generators = lazy, single-pass iterators. Use them to process large or infinite sequences without big memory bills.
  • yield pauses and saves state — unlike return.
  • Combine generator functions and generator expressions for readable, memory-efficient pipelines.
  • When you need random access, length, or multiple passes, use lists or views instead.

"If your data doesn't fit in memory, generators let you keep working. If it does, they still make your code elegant — and your machine less sad."


Next steps (practice)

  1. Convert a list-processing script into a generator pipeline.
  2. Use itertools.islice to sample the first N items from a generator.
  3. Try reading a 1GB CSV via a generator and calculating column stats without pandas (then compare performance).

If you enjoyed this, the natural next step is exploring coroutine patterns and async generators — powerful for streaming data and networked ML pipelines.

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics