Scaling and Distributed Fine-Tuning (DeepSpeed, FSDP, ZeRO)
Advanced distributed training strategies to scale fine-tuning across multiple GPUs and nodes while managing memory, communication, and fault tolerance.
Content
6.2 Data Parallelism vs Model Parallelism
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
6.2 Data Parallelism vs Model Parallelism — The Cage Match You Actually Need
"If your GPU memory is full and your wallet is empty, it's time to get distributed... but not indiscriminately."
You just read 6.1's architecture tour and earlier we squashed model sizes with quantization/pruning. Great! But remember: quantization can shrink parameter bytes, not the entire memory picture — optimizer states, activations, and the need to compute gradients still bite. This is where how you spread work across devices actually decides whether training is feasible, fast, or just a glorified paperweight.
What this section answers
- What's the conceptual difference between Data Parallelism (DP) and Model Parallelism (MP)?
- When do you pick one, the other, or both?
- How do ZeRO, FSDP, DeepSpeed and hardware topology change the game?
Spoiler: the right answer is often 'both' — but with nuance.
Quick elevator definitions (no fluff)
- Data Parallelism (DP): each device holds a full copy of the model; different devices train on different mini-batches and you sync gradients.
- Model Parallelism (MP): the model's parameters are split across devices; each device computes parts of the forward/backward pass for the same batch.
DP is 'same model × different data'. MP is 'different model parts × same data'.
The practical trade-offs (aka why this matters)
- Memory footprint: DP requires each device to store a full model (plus optimizer/activations), while MP spreads parameters across devices.
- Communication pattern: DP typically uses AllReduce to sync gradients; MP requires fine-grained passes (AllGather, Send/Recv) during forward/backward. Different overheads, different latencies.
- Scalability: DP scales well with more data and GPUs until memory becomes the bottleneck. MP scales models beyond single-GPU memory but introduces pipeline or tensor sync complexities.
- Implementation complexity: DP is easiest to implement. MP — especially tensor- and pipeline-parallel hybrids — is harder and more fragile.
Table: Head-to-head summary
| Aspect | Data Parallelism | Model Parallelism (TP/PP) | Hybrid (DP + MP, ZeRO, FSDP) |
|---|---|---|---|
| Memory per device | High (full model) | Lower (sharded model) | Lowest (shards + optimizer sharding) |
| Communication | Gradient AllReduce each step | Activations/grad slices between devices | Mix: AllReduce + AllGather/ReduceScatter |
| Latency sensitivity | Medium (depends on sync) | High (pipeline bubbles/tensor waits) | Complex: needs topology-aware orchestration |
| Best for | Many GPUs, smaller models, big batch sizes | Huge models that don't fit on single GPU | Training gigantic models with optimizer state reduction |
Real-world analogies (because metaphors stick)
- DP is like having 8 bakers each making a different batch of cookies with the same recipe; at the end they swap notes on mistakes (gradients) and update the recipe.
- TP (tensor-parallel) is like splitting a giant cake recipe step across chefs — one mixes flour, another sugar — they have to pass the batter between them fast or the cake sits.
- PP (pipeline-parallel) is an assembly line: chef 1 preps, passes to chef 2, chef 1 waits, leading to bubbles if you don't micro-batch.
DeepSpeed / ZeRO / FSDP: where the magic plugs in
- ZeRO (Zero Redundancy Optimizer): shards optimizer states, gradients, and parameters to reduce per-GPU memory. ZeRO Stage 3 + DeepSpeed basically lets DP operate on models that otherwise wouldn't fit.
- FSDP (Fully Sharded Data Parallel): PyTorch-native approach that shards parameters and optimizer states across GPUs while preserving DP-style semantics.
- DeepSpeed: offers ZeRO plus offloading and other optimizations. Pairing with MP strategies is common for 100B+ models.
Why mention these? Because ZeRO and FSDP make DP viable for massive models by removing redundant memory, thus shifting your trade-offs: you can get DP-like simplicity with memory that used to force MP.
When to choose what — pragmatic checklist
- If your model fits comfortably (with activations/optimizer) on each GPU: start with Data Parallelism + mixed precision + gradient accumulation. Simple wins.
- If model parameters fit but optimizer states + activations overflow: use ZeRO Stage 2/3 or FSDP to shard optimizer & params, keep DP semantics.
- If even with ZeRO/FSDP you can't fit a single replica: consider Model Parallelism (tensor parallelism for individual layers, pipeline parallelism for layer stacks), or a hybrid: TP/PP for model partition + ZeRO for optimizer sharding.
- If you have limited interconnect (slow network): favor approaches that minimize cross-node sync (e.g., reduce allreduce frequency, gradient accumulation, or bigger micro-batches). Hint: MP hurts more on slow links.
Performance tips & gotchas
- Use gradient accumulation to increase effective batch size when you must reduce per-step batch size on limited GPU RAM.
- Watch communication patterns: AllReduce scales as O(log N) with optimized algorithms, but latency matters. Tensor and pipeline parallelism require lower-latency links (NVLink / NVSwitch / RDMA).
- Micro-batch & pipeline bubble: pipeline parallelism needs careful micro-batch tuning and activation checkpointing to hide latency.
- Quantization helps, but not magic: quantizing parameters reduces model bytes, but optimizer states/activations can still overflow memory. ZeRO + quantization = better combined effect.
Mini pseudocode: DP vs MP (pseudo)
Data-parallel step (very simplified):
# each worker has full model copy
pred = model(batch)
loss = loss_fn(pred, labels)
loss.backward()
allreduce_gradients(model) # sync grads across workers
optimizer.step()
Model-parallel tensor split (toy):
# layer weights split across devices along hidden dim
x_part = matmul(x_slice, W_part)
# need to all-gather parts for subsequent layers or combine
x = all_gather(x_parts)
Note: actual MP often uses fused ops and careful comm scheduling to reduce synchronization overhead.
Quick hardware note
If you're on a single machine with NVLink/NVSwitch, model parallelism performs way better than across slow Ethernet. For multi-node clusters, invest in RDMA/Infiniband or pick ZeRO/FSDP to stay in the DP-friendly lane.
Final TL;DR (so you don’t miss the point)
- Start simple: DP + mixed precision + gradient accumulation. If memory fits, you win on simplicity.
- When memory blows up, reach for ZeRO / FSDP — they give you the best of DP semantics and sharded memory.
- Use Model Parallelism when the model literally cannot be split across replicas — and be prepared for nontrivial engineering (tensor/ pipeline parallelism, comm scheduling, and network dependence).
- Combine techniques: quantization reduces sizes, ZeRO/FSDP reduces optimizer and param redundancy, and model parallelism stretches a model across devices. They are friends, not enemies.
"Distributed training is a toolbox, not a religion. Pick the hammer, saw, and nail gun that actually build your model — and stop trying to force a screwdriver."
Key next steps: test with a small cluster, profile memory/comm hotspots, then gradually move from DP -> ZeRO/FSDP -> hybrid MP as needed. Happy tuning — may your gradients converge and your GPUs remain cool.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!