Performance and Resource Optimization
Techniques to maximize throughput and accuracy while minimizing GPU, memory, and energy costs through profiling, memory management, data pipelines, and scheduling strategies.
Content
2.3 Throughput and Latency Trade-offs
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
2.3 Throughput and Latency Trade-offs — The Art of Choosing Between "Fast and Furious" vs "Responsive and Polite"
"You can have throughput, or you can have latency. You can't have both at scale — unless you pay in memory, network magic, or dark CPU rituals." — Probably me, 3am, profiling a cluster
Quick reminder: we've already profiled where your GPUs/CPUs/IO choke (Section 2.1) and surgically reduced memory usage with tricks like mixed precision, activation checkpointing, and ZeRO (Section 2.2). Now we decide how to spend those wins: pump up throughput or shave off latency? This section tells you how to think about the trade-offs and gives practical knobs to tune.
Why this matters (and why it's painful)
- Throughput = how many tokens / samples you process per second (bulk efficiency). Think: fine-tuning terabytes of data overnight.
- Latency = time to respond to one sample or token (responsiveness). Think: interactive alignment runs, on-the-fly validation, or low-latency inference loops.
They pull in opposite directions: bigger batches and more pipelining → more throughput but often worse latency. Smaller batches, micro-batching, and synchronous steps → better latency but worse throughput and GPU utilization.
Ask yourself: "Is this batch job trying to eat the dataset, or is this a demo that must feel snappy?" The answer drives architecture.
Core metrics and tiny formulas (so you stop guessing)
- Throughput (tokens/sec) = total_tokens_processed / total_time
- Latency (ms/sample) = time_per_sample = (step_time * microbatch_size) / samples_per_step
- GPU Utilization (%) = busy_time / wall_time
Pseudocode to measure during runs:
start = now()
for step in range(N):
t0 = now()
out = model(batch)
loss = criterion(out, labels)
loss.backward()
optimizer.step()
step_times.append(now()-t0)
throughput = total_tokens / (now()-start)
p99_latency = percentile(step_times, 99)
Measure p50, p95, p99 latencies and tokens/sec — averages lie, percentiles sing.
The trade-off menu: knobs and their effects
1) Batch Size & Gradient Accumulation
- Bigger batch → higher throughput (better GPU amortization), worse per-sample latency. Also affects convergence dynamics (effective batch size matters).
- Gradient accumulation lets you keep per-step microbatch small (improves memory & sometimes latency per micro-step) while achieving large effective batch size (good throughput). But accumulation increases wall-clock step time and memory for accumulating gradients.
When to use: Use large effective batches for long batch jobs; for interactive tuning, cap microbatch size to hit latency targets.
2) Data Parallelism vs Tensor/Pipeline Parallelism
- Data parallelism: simple, good throughput as GPU count rises; synchronization (all-reduce) can add latency but is commonly optimized.
- Tensor parallelism: splits operations across devices — can reduce per-step latency for huge models but introduces inter-GPU communication at operator granularity.
- Pipeline parallelism: increases throughput by staging microbatches across ranks, but adds pipeline bubble latency that hurts single-sample latency.
Real-world pattern: Combine tensor + pipeline + data (a.k.a. 3-way parallelism). That’s great if you're okay with higher single-sample latency and want throughput.
3) Activation Checkpointing (Recompute)
- Saves memory (helps you increase batch size/scale), but recomputes activations during backward pass → higher compute cost and increased latency per step. Use selective checkpointing to balance.
4) ZeRO (Optimizer State Sharding)
- ZeRO Stage 1/2/3 progressively shard more state → lowers memory, enabling larger batches (improves throughput potential) but increases all-to-all communications → higher per-step latency. Choose stage based on whether you prioritize memory+throughput or low-latency steps.
5) Mixed Precision (FP16/BF16)
- In training, mixed precision usually improves throughput and lowers latency by halving memory and increasing compute throughput. Watch for numerical instability; BF16 is usually safer on Ampere+ hardware.
6) IO, Prefetching, and DataLoader
- Slow IO ruins both throughput and latency. Use async prefetch, pinned memory, and multiple workers. For low-latency interactive steps, cache smaller sampled subsets in RAM.
7) Kernel Fusion & Operator Optimizations
- Fused kernels reduce kernel launch overhead and memory traffic → improves both throughput and latency, especially for smaller microbatches.
8) Quantization & QAT
- Quantization can drastically improve inference latency and throughput. For training, quantization-aware training (QAT) is heavy; usually you quantize after fine-tuning for low-latency serving.
A compact comparison table
| Strategy | Improves Throughput | Improves Latency | Memory Impact | Notes |
|---|---|---|---|---|
| Larger batch size | ✅ | ❌ | ↑ | Best for batch jobs; watch convergence |
| Gradient accumulation | ✅ | ⚠️ | ↔ or ↑ | Good trade when microbatch small |
| Pipeline parallelism | ✅ | ❌ | ↔ | Pipeline bubbles hurt single-sample latency |
| Tensor parallelism | ✅ | ✅/⚠️ | ↔ | Comm-heavy but often reduces per-step time |
| Activation checkpointing | ✅ (indirect) | ❌ | ✅ | Saves memory, costs recompute |
| ZeRO Stage 3 | ✅ | ❌ | ✅ (big) | Enables huge batch sizes at comm cost |
| Mixed precision (FP16/BF16) | ✅ | ✅ | ✅ | Usually a win |
| Kernel fusion | ✅ | ✅ | ↔ | Particularly helpful for small batches |
Real-world mini-case: Fine-tuning a 70B model across 8 A100s
Options & trade-offs:
- Use ZeRO-3 + activation checkpointing → you can use huge effective batch sizes (massive throughput) but per-step latency rises due to communication; interactive validation will feel laggy.
- Use tensor parallelism + smaller pipelines + BF16 → reduces latency per microbatch and keeps throughput decent, but memory might restrict batch size.
Rule of thumb: if you need interactivity (fast eval loops, human-in-the-loop), prefer smaller microbatches, fewer pipeline stages, BF16, and local caching of eval data. If you run overnight experiments, favor maximum sharding and checkpointing to maximize throughput.
Practical tuning checklist (playbook)
- Start with profiling (Section 2.1) to see whether compute, memory, or comms dominate.
- If memory-bound, apply Section 2.2 techniques (mixed precision, checkpointing, ZeRO) — but reassess latency after each change.
- Set latency targets: pick p95/p99 budgets for interactive tasks.
- Increase batch size gradually until throughput stops improving or latency budget breaks.
- Experiment with microbatch size + gradient accumulation to hit effective batch size while observing step latency.
- Test comm patterns: try reducing all-reduce frequency, enabling NCCL tuning flags, and increasing compute-comm overlap.
- Use fused kernels and optimized libraries (cuDNN, FlashAttention, Triton kernels) to reduce small-batch overhead.
- For inference, prioritize quantization and distilled models — retrain only if necessary.
Closing thoughts — the decision tree you’ll memorize painfully well
- Bulk jobs (overnight): squeeze every ounce of throughput. Memory->ZeRO->checkpoint->max batch. Accept higher per-sample latency.
- Interactive/online: keep microbatches small, reduce pipeline depth, prefer FP32/BF16 stability, enable kernel fusion and cache hot data.
- Mixed needs: hybrid runs: do coarse-grained tuning in high-throughput mode, then run short interactive experiments on a smaller model or a warmed-up subset.
Final dramatic insight: optimization is negotiation. You’re bargaining between compute, memory, and network. Each time you win memory (ZeRO, checkpointing) you pay in comms or recompute; each time you chase latency with smaller batches, you anger GPU utilization gods. Use profiling often, measure the right percentiles, and let the task (and your SLA) decide the compromise.
TL;DR: Measure first, pick your villain (latency or throughput), and tune the knobs deliberately. There’s no free lunch — but there are many enjoyable little cheat codes.
Happy tuning. Burn one more epoch and then get some sleep.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!