Python for Data and AI
Practical Python skills and libraries essential for data manipulation and analysis.
Content
Python basics
Versions:
Watch & Learn
Python Basics for Data & AI: The No-Chill On-Ramp
You’ve read papers, promised the privacy gods you won’t log anyone’s social security number, and even peeked at experiment tracking. Now it’s time to speak the language data and models actually understand: Python.
We’re pivoting from big-picture foundations to hands-on basics. This is the bridge between “I get the idea” and “my code did the thing.” By the end, you’ll be comfortable writing clean, reproducible Python that plays nice with datasets, models, and your future self at 2 a.m.
1) Where You’ll Actually Write Python (and Why It Matters)
You have options, and yes, they each have vibes:
- Notebooks (Jupyter/Colab): Fantastic for exploration, plotting, and storytelling with your experiments. Keep cells small. Track outputs. Great with experiment tracking.
- Scripts (.py files): Stable, reproducible, and automation-friendly. Ideal when you’ve figured things out and want to run it again (and again) with new parameters.
- REPL (python / ipython): Quick pokes and prods when you forgot the exact method name for that one pandas thing.
Set up a clean environment so your future you doesn’t scream:
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install numpy pandas jupyter matplotlib
jupyter lab
Governance reminder: keep environments per project. It’s not just neat—it’s reproducibility, a.k.a. “ethically not lying to your future self.”
2) Data Types That Actually Matter
You’ll see these constantly in data and AI code. Learn them like the cast of your favorite series.
| Type | Example | Why You Care in AI/Data |
|---|---|---|
| int | 42 |
Counts, indices, sizes, epochs |
| float | 3.14 |
Losses, probabilities, metrics |
| bool | True/False |
Filtering, masks, branching |
| str | 'cat' |
Column names, labels, text |
| None | None |
Missing values, function defaults |
| list | [1, 2, 3] |
Sequences, rows, batches |
| tuple | (h, w) |
Immutable pairs, shapes |
| dict | {'label': 'dog'} |
Records, configs, JSON-like |
| set | {'cat','dog'} |
Unique values, fast membership |
Gotchas you’ll thank me for later:
- Float precision:
0.1 + 0.2 != 0.3exactly. Usemath.isclosefor comparisons. - None vs NaN:
Noneis Python’s empty chair;NaNis a special float from NumPy/pandas for missing numeric values.NaN != NaN(surprise!). - Mutability: Lists and dicts are mutable; tuples and strings aren’t. Mutability can sabotage reproducibility if you mutate function inputs mid-experiment.
# Truthiness is a vibe
if []: # empty list -> False
print('Nope')
if [0]: # non-empty -> True, even if it contains 0
print('This prints')
3) Control Flow and Comprehensions (a.k.a. Python’s Espresso Shot)
Start simple:
score = 0.83
if score > 0.9:
verdict = 'chef\'s kiss'
elif score > 0.75:
verdict = 'promising'
else:
verdict = 'back to the lab'
Loop like you mean it:
rows = [{'id': 1, 'label': 'cat'}, {'id': 2, 'label': 'dog'}]
for i, row in enumerate(rows, start=1):
print(f"Row {i}: id={row['id']} label={row['label']}")
Comprehensions for compact clarity:
labels = [row['label'] for row in rows] # ['cat', 'dog']
label_to_id = {row['label']: row['id'] for row in rows} # {'cat':1,'dog':2}
If your comprehension needs a map to understand, make it a loop. Readability > cleverness, especially when you’re debugging a midnight metric drop.
4) Functions, Purity, and Type Hints (Future-You Approved)
Functions should be small, predictable, and explicit. This helps with experiment tracking and reproducibility.
from typing import List
def clean_tokens(tokens: List[str], *, lowercase: bool = True, min_len: int = 2) -> List[str]:
"""Normalize and filter tokens.
Args:
tokens: Raw tokens.
lowercase: Convert to lowercase.
min_len: Minimum token length to keep.
"""
result = []
for t in tokens:
x = t.lower() if lowercase else t
if len(x) >= min_len:
result.append(x)
return result
Why type hints? They don’t change runtime (unless you use checkers), but they make your intent obvious and your IDE smarter.
Module structure that plays nice with scripts and notebooks:
# file: preprocess.py
def run(path: str) -> None:
# do some work, maybe save artifacts
...
if __name__ == '__main__':
# Only runs when executed as a script
run('data/raw.csv')
5) Files, Paths, and Your First Pandas Handshake
Use pathlib for OS-safe paths.
from pathlib import Path
import pandas as pd
DATA = Path('data')
df = pd.read_csv(DATA / 'train.csv')
print(df.head())
print(df.dtypes)
Large files? Don’t load the whole ocean—sip it in chunks.
means = []
for chunk in pd.read_csv(DATA / 'train.csv', chunksize=50_000):
means.append(chunk['age'].mean())
print(sum(means) / len(means))
Privacy ping: never casually print raw rows. Mask PII in logs and notebooks. The best data leak is the one that never happened.
6) Reproducibility Starter Pack: Seeds and Logging
You learned about experiment tracking; reproducibility starts in your Python file.
import os, random, numpy as np
def set_seed(seed: int = 1337):
random.seed(seed)
np.random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
# If you use PyTorch or TensorFlow, set their seeds too.
try:
import torch
torch.manual_seed(1337)
torch.cuda.manual_seed_all(1337)
torch.use_deterministic_algorithms(True)
except Exception:
pass
Logging > print, especially when your experiment has phases and parameters.
import logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s | %(levelname)s | %(message)s'
)
user_id = 'abc123' # pretend PII
logging.info('Training started with seed=%d', 1337)
logging.info('Masking user: %s', user_id[:3] + '***') # mask it, don’t leak it
Governance isn’t an afterthought. It’s an if-statement away from avoiding an incident report.
7) Errors, Exceptions, and the Art of Not Panicking
Stack traces are love notes from Python. Read them.
def safe_divide(a: float, b: float) -> float:
if b == 0:
raise ValueError('b must be non-zero')
return a / b
try:
print(safe_divide(10, 0))
except ValueError as e:
print('Handled:', e)
Tiny tests beat big regrets:
def test_safe_divide():
assert safe_divide(10, 2) == 5
You can run this in a notebook or start using pytest later.
8) Performance 101: Vectorize Before You Optimize
Loops are fine until they aren’t. NumPy and pandas vectorization uses fast C under the hood.
import numpy as np
x = np.random.randn(1_000_000)
# Loop (slow)
sum_loop = 0.0
for v in x:
sum_loop += v * v
# Vectorized (fast)
sum_vec = np.sum(x * x)
In notebooks, you can benchmark with magic commands:
# %timeit sum([v*v for v in x])
# %timeit np.sum(x*x)
Premature optimization is chaos; premature non-optimization is pain. Profile, then act.
9) Configs and CLI Parameters (Because You’ll Run This Again)
Hard-coding is how you lose track of what you ran. Pass parameters.
# file: train.py
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--lr', type=float, default=1e-3)
parser.add_argument('--epochs', type=int, default=10)
args = parser.parse_args()
print(f"Training with lr={args.lr}, epochs={args.epochs}")
This plays beautifully with experiment tracking: each run is a parameterized, logged event—not a mystery.
Quick Reality Check: Common Beginner Traps
- Shadowing built-ins: don’t name a variable
listordict. - Mutable defaults:
def f(x, cache={}):is a booby trap. UseNoneand set inside. - Silent dtype issues: strings pretending to be numbers in pandas. Check
dtypes. - Copy vs view in pandas:
.loc[...]generally safer; watch for SettingWithCopy warnings.
Closing: Your First Mini Pipeline
Here’s a simple, ethical, reproducible workflow you can try today:
- Create a venv and install
numpy,pandas, andjupyter. - Write a script that:
- Sets a seed and config via CLI.
- Reads a CSV in chunks.
- Computes a metric (mean, accuracy, whatever).
- Logs parameters and results, masking any PII.
- Run it twice with different parameters and record both in your experiment tracker.
- Compare results like the scientist you are.
The move from reading papers to writing code is where theory meets receipts. Python is how you get them.
Key Takeaways
- Python basics—types, control flow, functions—are not optional; they are the skeleton of every model you’ll train.
- Reproducibility is a habit: seeds, logging, configs, and environments.
- Pandas and NumPy will carry you far; use vectorization when speed matters.
- Privacy is a constraint and a design feature. Mask data, minimize logs, follow governance.
Next up: we’ll start wielding Python’s data libraries like a pro—cleaning, transforming, and feature engineering without crying into your CSVs.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!