Performance and Resource Optimization
Techniques to maximize throughput and accuracy while minimizing GPU, memory, and energy costs through profiling, memory management, data pipelines, and scheduling strategies.
Content
2.1 Profiling CPU, GPU, and I/O Bottlenecks
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
2.1 Profiling CPU, GPU, and I/O Bottlenecks — The Sleuth's Guide to Finding the Slowpoke
"If your model trains slower than molasses uphill in January, something is choking — and it's probably not the model's ego." — Your slightly theatrical TA
You're past the basics (remember Foundations of Fine-Tuning? The part about hardware tradeoffs and reproducibility?), so now we get practical: how to find exactly where time or bytes are being wasted when fine-tuning large language models. This section is a hands-on detective kit: measure, isolate, interpret, and fix—repeat until performance bows before you.
Why profiling matters (without repeating the lecture)
You know what a GPU is and why mixed precision helps (we covered “Hardware Considerations for Foundations”). But a fast GPU doesn't magically accelerate training if it's waiting on the CPU, the disk, or a reluctant PCIe bus. Profiling exposes the chokepoints so you can stop guessing and start optimizing by cause.
The profiling workflow (short, sharp, repeatable)
- Baseline: Record a representative run (model + batch size + dataloader). Keep seeds/logs for later reproducibility.
- Observe system-wide metrics: GPU utilization, CPU load, disk throughput, network.
- Instrument specific layers: framework profiler (PyTorch/TensorFlow).
- Isolate: synthetic data, single-process runs, CPU-only tokenization tests.
- Interpret & act: map symptoms → remedies.
- Repeat: confirm improvements; keep the logs.
What to measure and which tools to use
| What you're seeing | Key metric(s) | Tools / commands | What it usually means |
|---|---|---|---|
| GPU is idle or low | GPU utilization (%) | nvidia-smi, nsys, NVIDIA Nsight | Data starvation, CPU preprocessing bottleneck, or synchronization stalls |
| GPU memory contention | used/total memory, OOMs | nvidia-smi, torch.cuda.memory_summary() | Batch too large, memory fragmentation, no gradient checkpointing |
| High kernel time but low occupancy | SM utilization, memory bandwidth | nsys / Nsight Compute | Inefficient kernels / small batches / wrong datatype |
| CPU pegged near 100% | %CPU, load average, per-process CPU | top/htop, pidstat, perf | Heavy tokenization, too many Python callbacks, GIL-bound code |
| Slow data reads | read throughput (MB/s), iops, queue length | iostat, iotop, fio, dstat | Disk slow (NFS), unoptimized formats, lack of caching |
| Slow dataloader | Dataloader worker idle/queue metrics | PyTorch profiler, custom timing | Bad collate_fn, too few workers, slow transforms |
Quick checklist: "Is it the GPU or the rest of the circus?"
- Run: nvidia-smi -l 1 --query-gpu=utilization.gpu,utilization.memory,memory.used --format=csv
- If GPU utilization < 50% for a heavy model: suspect CPU/I/O starvation.
- If GPU memory used is low but utilization is low: maybe batch size too small or synchronization overhead.
Code block — a good monitoring one-liner:
watch -n 1 "nvidia-smi --query-gpu=timestamp,name,utilization.gpu,utilization.memory,memory.used --format=csv"
Profiling recipes (PyTorch-focused, but principles are universal)
1) Quick PyTorch profiler run
Use the profiler to see where time goes at the operator level.
from torch.profiler import profile, record_function, ProfilerActivity
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
with record_function("model_infer"):
model(input)
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=20))
This tells you which ops dominate CUDA time and where CPU time is spent (e.g., tokenization, data transforms).
2) System-level GPU + CPU trace: NVIDIA Nsight Systems (nsys)
nsys profile -o profile_demo --trace=cuda,osrt,hip python train.py
nsys-ui profile_demo.qdrep # or convert to timeline
This shows GPU kernel launches, CPU threads, copy transfers (H2D/D2H), and where synchronizations occur.
3) Disk and I/O probing
- iostat -x 1 10 # look at tps/MB_read/s/await
- iotop -o # active I/O per process
- fio for synthetic benchmarking of disk: measure read/write throughput and latency.
If your dataset is on NFS, expect higher latencies — benchmark it.
Common symptoms and pragmatic fixes
Symptom: GPU utilization low, CPU 100%
Fixes: Increase Dataloader num_workers, move CPU transforms to pre-processing pipeline, switch heavy transforms to vectorized NumPy/C++ code, use shared memory or memory-mapped files.Symptom: GPU waiting on transfers (many small cudaMemcpy)
Fixes: Use pinned (page-locked) memory, batch transfers, overlap H2D with compute via streams, use asynchronous dataloading, ensure prefetch_factor tuned.Symptom: Dataloader workers starve GPU
Fixes: Increase num_workers, use persistent_workers=True (PyTorch), pre-serialize transforms (cache pre-tokenized data), adopt WebDataset or LMDB/Arrow for sharded reads.Symptom: Kernel launch overhead / low occupancy
Fixes: Increase batch size, use mixed precision, fuse kernels, update CUDA/cuDNN or use library optimizations (apex/torch.compile/Trident/Triton kernels).Symptom: Storage-limited (low MB/s)
Fixes: Move dataset to NVMe/SSD, use local cache on node, use parallel read formats (WebDataset tar shards), prefetch reads.
Tiny advanced notes (so you can sound intimidating in meetings)
- PCIe vs NVLink: cross-GPU gradients or tensor shards transferred over PCIe are expensive — prefer NCCL and ensure NCCL uses NVLink if available.
- eBPF/profilers like BCC help when you suspect kernel-level bottlenecks.
- For multi-node, network saturation shows as low GPU utilization with low disk/CPU — profile NICs (ethtool, iftop) and check RDMA performance.
A minimal reproducible profiling checklist (copy-paste ready)
- Re-run with deterministic seed and small walltime to reproduce the slowdown.
- Collect system snapshot: nvidia-smi, top -b -n1, iostat -x 1 5.
- Run PyTorch profiler for 10–50 steps and inspect operator table.
- Run nsys to see H2D/D2H and kernel timelines.
- Replace dataset with synthetic random tensors — if speed jumps, it's I/O/CPU.
- Tweak one knob at a time (num_workers, batch_size, pin_memory) and re-measure.
Closing — a dramatic but useful mic-drop
Performance optimization is not guesswork; it's measurement followed by targeted change. Treat profiling like diagnostics: collect evidence, form a hypothesis, change exactly one variable, and measure again. Be patient — small wins (pinned memory, better dataloader design, or tiny batch-size changes) compound into large savings on long training runs.
Final thought: the GPU is a sprinter with expensive shoes. If you want it to run, keep the track (I/O), starting blocks (CPU prep), and relay passes (PCIe/NVLink transfers) in good order.
Now go profile something and make your cluster sing. Or at least stop it from whispering.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!