Practical Verification, Debugging, and Validation Pipelines
A focused module on building reliable, end-to-end validation and debugging workflows, ensuring reproducibility and rapid incident response in real-world pipelines.
Content
10.2 Debugging Training Instability
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
10.2 Debugging Training Instability — When Your Draconian LLM Starts Sneezing Molten NaNs
"Training instability is just the model's way of saying: I am dramatic and I will not be tamed without snacks and instrumentation." — Probably a researcher
You already learned how to wire up end-to-end validation pipelines (see 10.1) and you've read about the future tooling and safety concerns (9.15, 9.14). Now we get into the gritty, hands-on art of stabilizing training when the model behaves like an over-caffeinated dragon. This chapter is the practical, pragmatic toolkit for diagnosing and fixing instability: NaNs, exploding losses, wild oscillations, catastrophic forgetting mid-finetune, and training that looks like a roller coaster built by a politician.
Why this matters (quick, not preachy)
Instability costs you time, money, and trust. Small numerical issues turn into days of waste. For production-safe, performance-efficient fine-tuning, you need reproducible, tame training runs. This section gives you prioritized checks and interventions so you can find the culprit without ritual sacrifice.
Quick mental model: symptom → layer → root cause
Think of instability like symptoms in a hospital triage. A seizure (NaN gradients) is serious; fever (loss spikes) could be many things. Mapping symptom → layer → cause helps you triage faster.
Typical instability symptoms
- Immediate NaNs/inf in loss or gradients right after a weight update
- Exploding gradients: gradient norms grow rapidly
- Loss collapse to constant (no learning) or wild oscillations
- Plateau followed by catastrophic divergence mid-finetune
- High variance between seeds / runs (non-deterministic instability)
Quick table: symptom, likely cause, quick fix
| Symptom | Likely cause(s) | Quick mitigations to try first |
|---|---|---|
| NaNs in loss | Mixed precision, bad token indices, log(0), label leak | Enable fp32 for suspect op, validate tokenization, clamp logits, check label ranges |
| Exploding gradients | LR too high, no gradient clipping, unstable optimizer | Reduce LR by 3–10x, enable clipping, switch optimizer (AdamW→AdamW with eps tweak) |
| Wild oscillations | Bad LR schedule, momentum issues | Use smaller LR, add LR warmup, use cosine/linear scheduler |
| Silent no-learning | Teacher forcing mismatch, label smoothing too large | Check loss function, remove excessive regularization |
| High run variance | RNG, unseeded data shuffles, distributed sync | Fix seeds, deterministic dataloaders, validate gradient accum steps |
Step-by-step debugging checklist (do these in order)
Reproduce the failure deterministically
- Freeze random seeds, single-GPU if possible, single worker dataloader. If failure disappears at smaller scale, you have a distributed or data race issue.
Check data and tokenizer
- Unit-test your batches. Are token indices in range? Are labels aligned? Do sequence lengths explode? Use tiny synthetic batches to validate shapes and types.
Instrument gradients and parameter stats
- Log per-layer gradient norms, parameter norms, and activations (mean/std). Watch for sudden spikes.
Narrow to model vs data vs optimizer
- Try a few controlled experiments: same data with a small model; same model with a tiny synthetic dataset; optimizer tweaks. This isolates the axis of failure.
Check numerics and precision
- If using AMP (mixed precision), run fp32 to see if problem disappears. Tweak the AMP loss scale, or use dynamic loss-scaling.
Tune learning rate & warmup
- Reduce LR by factors of 3–10. Add/extend warmup. Replace abrupt LR schedule with a gentler one.
Apply safety nets
- Gradient clipping (global norm), weight decay sanity checks, stable optimizer hyperparameters (eps for Adam family), small batch sizes to keep updates stable.
Revisit architecture components
- LayerNorm vs BatchNorm issues, attention masking bugs, rotary or relative position encodings incorrectly applied. Replace or isolate suspect layers.
Check distributed training
- Are gradients being averaged correctly? Are fp16 overflow settings consistent across workers? Ensure consistent loss scaling and gradient synchronization.
Make a minimal repro and file a bug
- If unresolved, reduce to the smallest script that reproduces the instability and open an issue with precise logs.
Instrumentation: what to record and how
- Scalars: loss, per-layer grad_norm, param_norm, lr, amp_loss_scale, clip_counts
- Histograms: activations, gradients per layer, weight distribution snapshots
- GPU telemetry: memory, temp, OOM logs, NCCL errors
Use existing tools (TensorBoard, Weights & Biases, MLflow) and integrate them into your end-to-end validation pipeline from 10.1 so intermittent instabilities trigger alerts and artifact collection.
Code sketch (pseudo) for gradient checks:
# pseudocode: log grad norms and check NaNs
for step, batch in enumerate(train_loader):
loss = model(batch)
loss.backward()
for name, p in model.named_parameters():
if p.grad is None: continue
gnorm = p.grad.norm().item()
if math.isnan(gnorm) or math.isinf(gnorm):
save_debug_artifacts(step, batch, model_state, optimizer_state)
raise RuntimeError('NaN gradient in ' + name)
optimizer.step()
optimizer.zero_grad()
Specific gotchas (real-life gremlins)
- Label leakage: your validation loss drops because labels leaked into inputs. Suddenly training looks fine — until production explodes.
- Tokenization mismatch: training uses tokenizer v1, inference uses v2 (or padding changes). Off-by-one token indices produce garbage.
- Loss function cliffs: log/softmax misuse, dividing by zero in custom loss, numerical underflow in log-sum-exp.
- Optimizer state mismatch in checkpoint reload: stale momentum makes resumed training explode.
Interventions prioritized (fast → invasive)
- Lower LR / more warmup
- Switch off AMP or adjust loss scale
- Gradient clipping (global norm)
- Validate data pipeline and deterministic seeds
- Replace suspicious custom ops with safe PyTorch alternatives
- Reduce batch size or accumulate gradients more carefully
- If all else fails: instrument minimal reproducible example and bisect code/changes
Closing — The cha-cha of stability and robustness
Training stability is detective work. Start with reproducibility, instrument heavily, run quick hypothesis tests, and escalate interventions from hyperparameter hygiene to architectural surgery. Tie everything back into your E2E validation pipeline (10.1) so future runs fail loudly and with artifacts, not silently in production. Remember the broader context from chapter 9: as you adopt MoE, RAG, or continual learning, these instability sources multiply. Build tight telemetry and small repro scripts now, and you will save compute credits, time, and sanity later.
Final one-liner to remember: numerical issues are honest — they tell you exactly where your assumptions break. Listen, instrument, and fix.
Checklist to carry away
- Reproducible minimal repro? ✅
- Data unit tests? ✅
- Gradient/activation telemetry in place? ✅
- AMP vs fp32 experiment done? ✅
- LR/warmup/clip tried? ✅
If you can confidently answer yes to those, you have the bones of a stable training pipeline. Now go tame that draconian language model. Snack optional, persistence required.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!