Courses/Performance-Efficient Fine-Tuning: Mastering Scalable and Cost-Effective LLM Training (How to Tame and Train Your Draconian Language Model)/Performance and Resource Optimization

Performance and Resource Optimization

365 views

Techniques to maximize throughput and accuracy while minimizing GPU, memory, and energy costs through profiling, memory management, data pipelines, and scheduling strategies.

Content

1 of 15

2.1 Profiling CPU, GPU, and I/O Bottlenecks

Profiling with Sass: The Practical Detective Kit

167 views

intermediate

humorous

science

gpt-5-mini

167 views

Versions:

Profiling with Sass: The Practical Detective Kit

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

2.1 Profiling CPU, GPU, and I/O Bottlenecks — The Sleuth's Guide to Finding the Slowpoke

"If your model trains slower than molasses uphill in January, something is choking — and it's probably not the model's ego." — Your slightly theatrical TA

You're past the basics (remember Foundations of Fine-Tuning? The part about hardware tradeoffs and reproducibility?), so now we get practical: how to find exactly where time or bytes are being wasted when fine-tuning large language models. This section is a hands-on detective kit: measure, isolate, interpret, and fix—repeat until performance bows before you.

Why profiling matters (without repeating the lecture)

You know what a GPU is and why mixed precision helps (we covered “Hardware Considerations for Foundations”). But a fast GPU doesn't magically accelerate training if it's waiting on the CPU, the disk, or a reluctant PCIe bus. Profiling exposes the chokepoints so you can stop guessing and start optimizing by cause.

The profiling workflow (short, sharp, repeatable)

Baseline: Record a representative run (model + batch size + dataloader). Keep seeds/logs for later reproducibility.
Observe system-wide metrics: GPU utilization, CPU load, disk throughput, network.
Instrument specific layers: framework profiler (PyTorch/TensorFlow).
Isolate: synthetic data, single-process runs, CPU-only tokenization tests.
Interpret & act: map symptoms → remedies.
Repeat: confirm improvements; keep the logs.

What to measure and which tools to use

What you're seeing	Key metric(s)	Tools / commands	What it usually means
GPU is idle or low	GPU utilization (%)	nvidia-smi, nsys, NVIDIA Nsight	Data starvation, CPU preprocessing bottleneck, or synchronization stalls
GPU memory contention	used/total memory, OOMs	nvidia-smi, torch.cuda.memory_summary()	Batch too large, memory fragmentation, no gradient checkpointing
High kernel time but low occupancy	SM utilization, memory bandwidth	nsys / Nsight Compute	Inefficient kernels / small batches / wrong datatype
CPU pegged near 100%	%CPU, load average, per-process CPU	top/htop, pidstat, perf	Heavy tokenization, too many Python callbacks, GIL-bound code
Slow data reads	read throughput (MB/s), iops, queue length	iostat, iotop, fio, dstat	Disk slow (NFS), unoptimized formats, lack of caching
Slow dataloader	Dataloader worker idle/queue metrics	PyTorch profiler, custom timing	Bad collate_fn, too few workers, slow transforms

Quick checklist: "Is it the GPU or the rest of the circus?"

Run: nvidia-smi -l 1 --query-gpu=utilization.gpu,utilization.memory,memory.used --format=csv
If GPU utilization < 50% for a heavy model: suspect CPU/I/O starvation.
If GPU memory used is low but utilization is low: maybe batch size too small or synchronization overhead.

Code block — a good monitoring one-liner:

watch -n 1 "nvidia-smi --query-gpu=timestamp,name,utilization.gpu,utilization.memory,memory.used --format=csv"

Profiling recipes (PyTorch-focused, but principles are universal)

1) Quick PyTorch profiler run

Use the profiler to see where time goes at the operator level.

from torch.profiler import profile, record_function, ProfilerActivity
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
    with record_function("model_infer"):
        model(input)
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=20))

This tells you which ops dominate CUDA time and where CPU time is spent (e.g., tokenization, data transforms).

2) System-level GPU + CPU trace: NVIDIA Nsight Systems (nsys)

nsys profile -o profile_demo --trace=cuda,osrt,hip python train.py
nsys-ui profile_demo.qdrep  # or convert to timeline

This shows GPU kernel launches, CPU threads, copy transfers (H2D/D2H), and where synchronizations occur.

3) Disk and I/O probing

iostat -x 1 10 # look at tps/MB_read/s/await
iotop -o # active I/O per process
fio for synthetic benchmarking of disk: measure read/write throughput and latency.

If your dataset is on NFS, expect higher latencies — benchmark it.

Common symptoms and pragmatic fixes

Symptom: GPU utilization low, CPU 100%
Fixes: Increase Dataloader num_workers, move CPU transforms to pre-processing pipeline, switch heavy transforms to vectorized NumPy/C++ code, use shared memory or memory-mapped files.
Symptom: GPU waiting on transfers (many small cudaMemcpy)
Fixes: Use pinned (page-locked) memory, batch transfers, overlap H2D with compute via streams, use asynchronous dataloading, ensure prefetch_factor tuned.
Symptom: Dataloader workers starve GPU
Fixes: Increase num_workers, use persistent_workers=True (PyTorch), pre-serialize transforms (cache pre-tokenized data), adopt WebDataset or LMDB/Arrow for sharded reads.
Symptom: Kernel launch overhead / low occupancy
Fixes: Increase batch size, use mixed precision, fuse kernels, update CUDA/cuDNN or use library optimizations (apex/torch.compile/Trident/Triton kernels).
Symptom: Storage-limited (low MB/s)
Fixes: Move dataset to NVMe/SSD, use local cache on node, use parallel read formats (WebDataset tar shards), prefetch reads.

Tiny advanced notes (so you can sound intimidating in meetings)

PCIe vs NVLink: cross-GPU gradients or tensor shards transferred over PCIe are expensive — prefer NCCL and ensure NCCL uses NVLink if available.
eBPF/profilers like BCC help when you suspect kernel-level bottlenecks.
For multi-node, network saturation shows as low GPU utilization with low disk/CPU — profile NICs (ethtool, iftop) and check RDMA performance.

A minimal reproducible profiling checklist (copy-paste ready)

Re-run with deterministic seed and small walltime to reproduce the slowdown.
Collect system snapshot: nvidia-smi, top -b -n1, iostat -x 1 5.
Run PyTorch profiler for 10–50 steps and inspect operator table.
Run nsys to see H2D/D2H and kernel timelines.
Replace dataset with synthetic random tensors — if speed jumps, it's I/O/CPU.
Tweak one knob at a time (num_workers, batch_size, pin_memory) and re-measure.

Closing — a dramatic but useful mic-drop

Performance optimization is not guesswork; it's measurement followed by targeted change. Treat profiling like diagnostics: collect evidence, form a hypothesis, change exactly one variable, and measure again. Be patient — small wins (pinned memory, better dataloader design, or tiny batch-size changes) compound into large savings on long training runs.

Final thought: the GPU is a sprinter with expensive shoes. If you want it to run, keep the track (I/O), starting blocks (CPU prep), and relay passes (PCIe/NVLink transfers) in good order.

Now go profile something and make your cluster sing. Or at least stop it from whispering.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics