Deep Learning Foundations
Understand neural networks and train models with PyTorch, from CNNs to transformers and deployment.
Content
Training Loops and Optimizers
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Training Loops and Optimizers — The Real Heartbeat of Deep Learning
"If tensors are the clay and models are the sculptor, the training loop is the gym membership where the sculptor goes to lift weights — literally."
You're coming in hot from PyTorch Tensors and Building Models in PyTorch, so we'll skip the "what is a tensor" scene and go straight to the action sequence: how to train the model. You already know how to define layers and move tensors to GPU — now we make those parameters stop sucking and start predicting.
Why this matters (and how it connects to scikit-learn pipelines)
- In scikit-learn land you learned reproducible pipelines: clean data → transform → model → evaluate. That pipeline pattern still matters with deep learning. You just swap the estimator for a PyTorch model and add a training loop in between.
- The training loop + optimizer is where learning actually happens. Think of it as: DataLoader feeds batches → forward pass computes predictions → loss says how wrong we are → backward pass computes gradients → optimizer updates weights.
This is the progression: PyTorch Tensors → Model architecture → Training loop + Optimizer → Evaluation & Checkpointing.
Quick mental map (so you don't get lost at 3 AM)
- Put model in train() mode.
- Loop over batches from DataLoader.
- Zero gradients (optimizer.zero_grad()).
- Forward pass to get outputs and loss.
- backward() to compute grads.
- optimizer.step() to update weights.
- Track metrics and occasionally validate / checkpoint.
Micro explanation: Why zero gradients?
Gradients accumulate by default in PyTorch. If you don't zero them, every backward call will add to previous grads — like trying to do math by continually writing over last night's homework without erasing. Not what you want.
A compact training loop example (PyTorch)
# assume: model, loss_fn, optimizer, train_loader, val_loader, device
model.to(device)
for epoch in range(num_epochs):
model.train()
train_loss = 0.0
for X_batch, y_batch in train_loader:
X_batch, y_batch = X_batch.to(device), y_batch.to(device)
optimizer.zero_grad() # clear old gradients
outputs = model(X_batch) # forward pass
loss = loss_fn(outputs, y_batch) # compute loss
loss.backward() # compute gradients
optimizer.step() # apply gradients
train_loss += loss.item() * X_batch.size(0)
train_loss /= len(train_loader.dataset)
# Validation
model.eval()
val_loss = 0.0
with torch.no_grad():
for X_val, y_val in val_loader:
X_val, y_val = X_val.to(device), y_val.to(device)
pred = model(X_val)
loss = loss_fn(pred, y_val)
val_loss += loss.item() * X_val.size(0)
val_loss /= len(val_loader.dataset)
print(f"Epoch {epoch}: train={train_loss:.4f}, val={val_loss:.4f}")
Notes on the example
- model.train() and model.eval() toggle dropout, batchnorm, etc. It's not just semantics.
- torch.no_grad() prevents gradient storage during validation, saving memory and time.
Optimizers: the difference between pushing and coaxing
Optimizers decide how to move parameters using gradients. Here are the main players and quick intuition:
- SGD (Stochastic Gradient Descent): Basic. Steps in negative gradient direction. Add momentum to mimic inertia — it helps smoothing noisy updates.
- Adam: Adaptive learning-rate per parameter. Fast convergence on many problems. Uses running estimates of first and second moments (mean and uncentered variance) of gradients.
- RMSprop, AdamW: Variants dealing with adaptive steps and proper weight decay handling.
When to use what
- For quick experiments: Adam is a safe bet.
- For final training on large-scale problems: SGD with momentum + learning-rate schedule often gives better generalization.
- Regularization: prefer weight decay (L2) in optimizers (AdamW) instead of naive L2 in Adam.
Example optimizer setup
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)
# or
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)
Advanced but practical techniques
- LR Schedulers: learning rate is the single most important hyperparameter. Use schedulers (StepLR, CosineAnnealingLR, ReduceLROnPlateau) to decay the lr during training.
- Gradient clipping: prevents exploding gradients in RNNs or very deep nets: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm).
- Mixed precision (AMP): faster training using torch.cuda.amp for reduced memory and higher throughput.
- Parameter groups: set different lrs or weight decay for different parts of the model (e.g., lower lr for pretrained backbone).
Example scheduler usage:
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
for epoch in range(epochs):
train_one_epoch(...)
scheduler.step()
Reproducibility & checkpoints — the boring but heroic part
- Save and load model + optimizer state_dicts. This keeps your weights and optimizer momentum/history intact.
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}, 'checkpoint.pth')
# load
ckpt = torch.load('checkpoint.pth')
model.load_state_dict(ckpt['model_state_dict'])
optimizer.load_state_dict(ckpt['optimizer_state_dict'])
- For reproducibility: torch.manual_seed(42), np.random.seed(42), random.seed(42); consider torch.backends.cudnn.deterministic = True but beware of performance tradeoffs.
Debugging tips (because training sometimes fails like a soap opera)
- Watch the loss: if it’s NaN, check for learning rate too high, exploding gradients, or incorrect labels.
- If loss doesn't decrease: try a higher lr, switch optimizers, verify model outputs and loss inputs shapes, or overfit a tiny dataset (sanity check).
- Use tensorboard or weights & biases for scalars, histograms, and gradient visualizations.
Putting it into your data-science workflow
- Integrate preprocessing from your scikit-learn pipeline upstream (StandardScaler, feature transforms) — save the transforms and apply consistently in train/val/test.
- Treat the PyTorch model as a stage in a pipeline: Data ingestion → sklearn-style preprocessing → PyTorch Dataset/DataLoader → training loop → evaluation and export.
- Use grid search or optuna for hyperparameter search; wrap training into a function that returns validation metrics so it plays nice with hyperparameter optimization.
Key takeaways
- The training loop is where forward, backward, and optimizer.step() meet to make learning happen.
- Choose optimizers depending on the problem: Adam to iterate quickly; SGD+momentum for final accuracy and better generalization in many cases.
- Remember: zero_grad(), backward(), step() — the holy trinity.
- Use schedulers, checkpointing, and reproducibility practices to make experiments reliable and comparable.
"The optimizer is your coach; the scheduler is your season plan; the training loop is the daily grind. Win the grind and you win the model."
If you want, I can:
- Give a ready-to-run training loop template with AMP and LR scheduler; or
- Show how to wrap a PyTorch model into an sklearn-friendly API for hyperparameter search; or
- Provide a checklist for debugging stalled training.
Which one should we riff on next?
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!