Python Basics for Data & AI: The No-Chill On-Ramp

You’ve read papers, promised the privacy gods you won’t log anyone’s social security number, and even peeked at experiment tracking. Now it’s time to speak the language data and models actually understand: Python.

We’re pivoting from big-picture foundations to hands-on basics. This is the bridge between “I get the idea” and “my code did the thing.” By the end, you’ll be comfortable writing clean, reproducible Python that plays nice with datasets, models, and your future self at 2 a.m.

1) Where You’ll Actually Write Python (and Why It Matters)

You have options, and yes, they each have vibes:

Notebooks (Jupyter/Colab): Fantastic for exploration, plotting, and storytelling with your experiments. Keep cells small. Track outputs. Great with experiment tracking.
Scripts (.py files): Stable, reproducible, and automation-friendly. Ideal when you’ve figured things out and want to run it again (and again) with new parameters.
REPL (python / ipython): Quick pokes and prods when you forgot the exact method name for that one pandas thing.

Set up a clean environment so your future you doesn’t scream:

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install numpy pandas jupyter matplotlib
jupyter lab

Governance reminder: keep environments per project. It’s not just neat—it’s reproducibility, a.k.a. “ethically not lying to your future self.”

2) Data Types That Actually Matter

You’ll see these constantly in data and AI code. Learn them like the cast of your favorite series.

Type	Example	Why You Care in AI/Data
int	`42`	Counts, indices, sizes, epochs
float	`3.14`	Losses, probabilities, metrics
bool	`True`/`False`	Filtering, masks, branching
str	`'cat'`	Column names, labels, text
None	`None`	Missing values, function defaults
list	`[1, 2, 3]`	Sequences, rows, batches
tuple	`(h, w)`	Immutable pairs, shapes
dict	`{'label': 'dog'}`	Records, configs, JSON-like
set	`{'cat','dog'}`	Unique values, fast membership

Gotchas you’ll thank me for later:

Float precision: 0.1 + 0.2 != 0.3 exactly. Use math.isclose for comparisons.
None vs NaN: None is Python’s empty chair; NaN is a special float from NumPy/pandas for missing numeric values. NaN != NaN (surprise!).
Mutability: Lists and dicts are mutable; tuples and strings aren’t. Mutability can sabotage reproducibility if you mutate function inputs mid-experiment.

# Truthiness is a vibe
if []:  # empty list -> False
    print('Nope')
if [0]:  # non-empty -> True, even if it contains 0
    print('This prints')

3) Control Flow and Comprehensions (a.k.a. Python’s Espresso Shot)

Start simple:

score = 0.83
if score > 0.9:
    verdict = 'chef\'s kiss'
elif score > 0.75:
    verdict = 'promising'
else:
    verdict = 'back to the lab'

Loop like you mean it:

rows = [{'id': 1, 'label': 'cat'}, {'id': 2, 'label': 'dog'}]
for i, row in enumerate(rows, start=1):
    print(f"Row {i}: id={row['id']} label={row['label']}")

Comprehensions for compact clarity:

labels = [row['label'] for row in rows]  # ['cat', 'dog']
label_to_id = {row['label']: row['id'] for row in rows}  # {'cat':1,'dog':2}

If your comprehension needs a map to understand, make it a loop. Readability > cleverness, especially when you’re debugging a midnight metric drop.

4) Functions, Purity, and Type Hints (Future-You Approved)

Functions should be small, predictable, and explicit. This helps with experiment tracking and reproducibility.

from typing import List

def clean_tokens(tokens: List[str], *, lowercase: bool = True, min_len: int = 2) -> List[str]:
    """Normalize and filter tokens.

    Args:
        tokens: Raw tokens.
        lowercase: Convert to lowercase.
        min_len: Minimum token length to keep.
    """
    result = []
    for t in tokens:
        x = t.lower() if lowercase else t
        if len(x) >= min_len:
            result.append(x)
    return result

Why type hints? They don’t change runtime (unless you use checkers), but they make your intent obvious and your IDE smarter.

Module structure that plays nice with scripts and notebooks:

# file: preprocess.py

def run(path: str) -> None:
    # do some work, maybe save artifacts
    ...

if __name__ == '__main__':
    # Only runs when executed as a script
    run('data/raw.csv')

5) Files, Paths, and Your First Pandas Handshake

Use pathlib for OS-safe paths.

from pathlib import Path
import pandas as pd

DATA = Path('data')
df = pd.read_csv(DATA / 'train.csv')
print(df.head())
print(df.dtypes)

Large files? Don’t load the whole ocean—sip it in chunks.

means = []
for chunk in pd.read_csv(DATA / 'train.csv', chunksize=50_000):
    means.append(chunk['age'].mean())
print(sum(means) / len(means))

Privacy ping: never casually print raw rows. Mask PII in logs and notebooks. The best data leak is the one that never happened.

6) Reproducibility Starter Pack: Seeds and Logging

You learned about experiment tracking; reproducibility starts in your Python file.

import os, random, numpy as np

def set_seed(seed: int = 1337):
    random.seed(seed)
    np.random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)

# If you use PyTorch or TensorFlow, set their seeds too.
try:
    import torch
    torch.manual_seed(1337)
    torch.cuda.manual_seed_all(1337)
    torch.use_deterministic_algorithms(True)
except Exception:
    pass

Logging > print, especially when your experiment has phases and parameters.

import logging

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s | %(levelname)s | %(message)s'
)

user_id = 'abc123'  # pretend PII
logging.info('Training started with seed=%d', 1337)
logging.info('Masking user: %s', user_id[:3] + '***')  # mask it, don’t leak it

Governance isn’t an afterthought. It’s an if-statement away from avoiding an incident report.

7) Errors, Exceptions, and the Art of Not Panicking

Stack traces are love notes from Python. Read them.

def safe_divide(a: float, b: float) -> float:
    if b == 0:
        raise ValueError('b must be non-zero')
    return a / b

try:
    print(safe_divide(10, 0))
except ValueError as e:
    print('Handled:', e)

Tiny tests beat big regrets:

def test_safe_divide():
    assert safe_divide(10, 2) == 5

You can run this in a notebook or start using pytest later.

8) Performance 101: Vectorize Before You Optimize

Loops are fine until they aren’t. NumPy and pandas vectorization uses fast C under the hood.

import numpy as np
x = np.random.randn(1_000_000)

# Loop (slow)
sum_loop = 0.0
for v in x:
    sum_loop += v * v

# Vectorized (fast)
sum_vec = np.sum(x * x)

In notebooks, you can benchmark with magic commands:

# %timeit sum([v*v for v in x])
# %timeit np.sum(x*x)

Premature optimization is chaos; premature non-optimization is pain. Profile, then act.

9) Configs and CLI Parameters (Because You’ll Run This Again)

Hard-coding is how you lose track of what you ran. Pass parameters.

# file: train.py
import argparse

parser = argparse.ArgumentParser()
parser.add_argument('--lr', type=float, default=1e-3)
parser.add_argument('--epochs', type=int, default=10)
args = parser.parse_args()

print(f"Training with lr={args.lr}, epochs={args.epochs}")

This plays beautifully with experiment tracking: each run is a parameterized, logged event—not a mystery.

Quick Reality Check: Common Beginner Traps

Shadowing built-ins: don’t name a variable list or dict.
Mutable defaults: def f(x, cache={}): is a booby trap. Use None and set inside.
Silent dtype issues: strings pretending to be numbers in pandas. Check dtypes.
Copy vs view in pandas: .loc[...] generally safer; watch for SettingWithCopy warnings.

Closing: Your First Mini Pipeline

Here’s a simple, ethical, reproducible workflow you can try today:

Create a venv and install numpy, pandas, and jupyter.
Write a script that:
- Sets a seed and config via CLI.
- Reads a CSV in chunks.
- Computes a metric (mean, accuracy, whatever).
- Logs parameters and results, masking any PII.
Run it twice with different parameters and record both in your experiment tracker.
Compare results like the scientist you are.

The move from reading papers to writing code is where theory meets receipts. Python is how you get them.

Key Takeaways

Python basics—types, control flow, functions—are not optional; they are the skeleton of every model you’ll train.
Reproducibility is a habit: seeds, logging, configs, and environments.
Pandas and NumPy will carry you far; use vectorization when speed matters.
Privacy is a constraint and a design feature. Mask data, minimize logs, follow governance.

Next up: we’ll start wielding Python’s data libraries like a pro—cleaning, transforming, and feature engineering without crying into your CSVs.

Python for Data and AI

Content

Python basics

Versions:

Chapter Study

Watch & Learn