Python Foundations for Data Work
Master core Python syntax and tooling for data tasks, from environments and notebooks to clean, reliable scripts.
Content
Functions and Docstrings
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Functions and Docstrings — Your Python Superpowers for Data Work
"Functions are the recipes; docstrings are the little sticky notes that save you from burning the soufflé."
You already know how to make decisions in code (remember: booleans and logic, and the glorious if/else branching from Conditionals and Control Flow). Functions are the next level: reusable bundles of behavior that stop you from copy-pasting the same logic into twelve different cells and then crying when a bug appears.
In this lesson we'll cover: what functions are, why they matter for data work, function anatomy, parameter types (including *args and **kwargs), scope and side effects, lambdas and higher-order functions, and — crucially — how to write docstrings that actually help future-you (and your teammates).
Why functions matter for data work
- Reusability: Clean, tested behavior you can call everywhere (e.g., data cleaning steps).
- Readability: A well-named function turns a block of code into a readable sentence.
- Testing: Small functions are easy to unit test.
- Composition: Combine functions like building blocks for pipelines.
Imagine you're preparing a dataset for ML. Instead of repeating the same normalization/cleaning code in multiple notebooks, wrap it in a function and call it from every experiment. Your future self will thank you; your past self who didn't write tests will get roasted by your teammates.
Anatomy of a function (quick tour)
Micro explanation: basic structure
def greet(name):
"""Return a friendly greeting string for name."""
return f"Hi, {name}!"
print(greet('Ada')) # Hi, Ada!
defintroduces the function.greetis the function name (use verbs for actions:calculate_mean,filter_outliers).(name)are parameters.- The triple-quoted string inside is the docstring — the built-in help for the function.
returnsends a value back to the caller.
Docstrings: the non-negotiable sticky note
A docstring should answer: What does this do? What are the inputs? What does it return? Any side effects or exceptions?
Good minimal docstring style (one-line + optional details):
def mean(values):
"""Compute the arithmetic mean of a sequence of numbers.
Args:
values (Sequence[float]): Iterable of numbers.
Returns:
float: The mean value.
"""
return sum(values) / len(values)
Use styles your team prefers — NumPy style, Google style, or reStructuredText for Sphinx. The key: be consistent.
You can access it with help(mean) or mean.__doc__.
Parameters — the flavors
Positional and keyword arguments
def scale(x, factor=1.0):
"""Scale x by factor (default 1.0)."""
return x * factor
# positional
scale(5, 2)
# keyword
scale(5, factor=2)
Default argument gotcha (mutable defaults!)
def add_tag(record, tags=[]):
tags.append('new')
return tags
# unexpected shared list across calls
add_tag({}) # ['new']
add_tag({}) # ['new', 'new'] <-- oops!
Fix with None sentinel:
def add_tag(record, tags=None):
if tags is None:
tags = []
tags.append('new')
return tags
*args and **kwargs
Use these when you don't know how many args might be passed (common in wrappers):
def concat(*arrays, axis=0):
# arrays is a tuple of positional arguments
pass
def plot(series, **plot_kwargs):
# plot_kwargs forwarded to plotting library
pass
Scope, side effects, and pure functions
- Local scope: Variables inside a function don't touch the outside unless returned.
- Global variables: Can be read, but modifying requires
global— usually a smell. - Side effects: Printing, writing files, mutating inputs. OK when intentional, bad when hidden.
Prefer pure functions (no side effects, consistent outputs for same inputs) for testing and reasoning. But in data work you often need side effects (writing CSVs). Just keep them explicit.
Example using previous lessons: a predicate function with booleans
def is_outlier(x, lower, upper):
"""Return True if x is outside [lower, upper]. Uses boolean logic."""
return (x < lower) or (x > upper)
# used with filter (control flow knowledge applies when inspecting values)
values = [1, 20, 3, 100]
filtered = [v for v in values if not is_outlier(v, 0, 50)]
Lambdas and higher-order functions
- Lambdas: tiny anonymous functions, use sparingly for simple transforms.
squared = lambda x: x*x
list(map(lambda x: x*x, [1,2,3]))
- Higher-order functions: functions that accept or return functions. Useful for pipelines.
def make_multiplier(factor):
def multiply(x):
return x * factor
return multiply
double = make_multiplier(2)
double(5) # 10
Map/filter/reduce or comprehensions are your friends for readable data transformations.
Docstring conventions that actually help
A practical template (Google style):
Short one-line summary.
Args:
param1 (type): Description.
param2 (type, optional): Description. Defaults to something.
Returns:
type: What is returned.
Raises:
ErrorType: When something goes wrong.
Tip: include examples. Many will copy-paste your example and expect it to work.
def normalize(col):
"""Normalize a numeric column to mean 0 and sd 1.
Args:
col (Sequence[float]): Input values.
Returns:
list[float]: Normalized values.
Example:
>>> normalize([1,2,3])
[-1.0, 0.0, 1.0]
"""
# implementation omitted
pass
For libraries, follow NumPy/SciPy docstring conventions to integrate with Sphinx.
Quick checklist before you commit a function
- Name: clear verb-based name (e.g.,
compute_rmse). - Docstring: one-line summary + args + returns + example.
- Side effects: explicit or none.
- Tests: small unit tests for edge cases (empty input, NaNs).
- Avoid mutable defaults.
- Keep functions short (single responsibility).
Key takeaways
- Functions package behavior for reuse, readability, and testing — essential in data work.
- Docstrings are the user manual for your function. One-liners are fine, but examples + parameter/return descriptions make your life easier.
- Watch out for mutable defaults and hidden side effects.
- Use *args/**kwargs, lambdas, and higher-order functions when they simplify your pipeline; don't overuse them.
"This is the moment where the concept finally clicks." — you, after writing one clean reusable function that saves you hours across experiments.
Want a quick exercise? Create a function clean_and_summarize(df) that (1) drops NaNs, (2) casts a date column to datetime, (3) computes column means, and (4) includes a helpful docstring and an example. Use small, testable helper functions where it makes sense.
Go forth and modularize. Your notebooks (and teammates) will breathe easier.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!