3D Parallelism
A 175B model does not fit on one GPU. It does not fit on one node either. It does not even fit cleanly on one rack. The way it does fit is by splitting the model along three axes at once: tensor parallel inside the node, pipeline parallel across nodes within a pod, and data parallel across pods. That is 3D parallelism.
What "3D" actually means
3D parallelism is the composition of three orthogonal parallelism axes:
- Tensor parallelism (TP): shard each matmul. See tensor parallelism.
- Pipeline parallelism (PP): shard the layer stack. See pipeline parallelism.
- Data parallelism (DP): replicate the model and process different batches. Often combined with FSDP/ZeRO sharding to keep memory tractable.
The world size of a 3D parallel training run is TP * PP * DP. Each GPU is uniquely identified by its position on these three axes: which TP rank within its TP group, which PP stage, and which DP replica.
For GPT-3 (175B) on 1536 V100s, the configuration was TP=8 PP=8 DP=24. For Megatron-Turing NLG (530B), it was TP=8 PP=35 DP=8 on 2240 A100s. For Llama 2 70B, TP=8 PP=4 DP=64 on 2048 A100s. The exact factorization depends on the model size, the cluster topology, and the global batch size, but the pattern of "factor world_size into three" holds.
How the axes match the fabric tiers
Each axis has different communication characteristics, and matching axis to fabric tier is the placement rule that makes 3D parallelism practical:
- TP collectives are small all-reduces, frequent, latency-sensitive. They go on NVLink. TP groups must fit inside one NVLink domain.
- PP sends are point-to-point, small to medium size, latency-tolerant. They go on InfiniBand. PP stages can span nodes within a fabric pod.
- DP all-reduces (or FSDP all-gather + reduce-scatter) are large, less frequent, latency-tolerant. They go on InfiniBand. DP replicas can span the entire cluster.
This three-tier matching is exactly the topology-aware placement rule generalized. The TP * PP * DP = world_size factorization is bounded above by the fabric: TP cannot exceed the NVLink domain (8 on HGX H100, 72 on NVL72), PP cannot reasonably exceed one rail-optimized pod (because the per-stage activation traffic crosses the spine), and DP fills the rest.
How the math constraints stack up
For a 70B model in BF16 at TP=8 PP=4 DP=64:
- TP=8 splits each matmul. Per GPU weight memory: ~ 17.5 GB (model) / 8 = 2.2 GB before PP.
- PP=4 splits the layer stack. Per GPU weight memory: ~ 2.2 GB / 4 = 0.55 GB. With Adam state and FP32 master weights: ~ 8.8 GB total. Fits comfortably on H100.
- DP=64 replicates and reduces gradients. Replication does not save memory; it multiplies throughput.
- Global batch size =
local_batch * num_micro_batches * DP. For DP=64 and local micro-batch 4 with m=64 micro-batches: global batch = 4 * 64 * 64 = 16384.
Tweaking any of these factors ripples through the others. Doubling DP halves the steps to convergence (in samples) but doubles the network traffic. Halving PP doubles the per-stage memory but eliminates half the bubble. Picking the right factorization is a multi-variable optimization, and the right answer depends on the cluster, the model architecture, and the target time-to-train.
What 3D parallelism does not solve
3D parallelism does not address activation memory. For long contexts, activation memory dominates regardless of TP/PP/DP factors. The standard fix is to add sequence parallelism on top, making it 4D: TP * SP * PP * DP. For MoE models, expert parallelism adds another axis: TP * EP * PP * DP. The Megatron-LM 24.05 release supports up to 6D parallelism (TP * SP * EP * CP * PP * DP) for very large MoE models with long contexts.
3D parallelism also does not address straggler sensitivity. The PP axis is sensitive to per-stage latency (a 5% slow stage = 5% slow step). The DP axis is less sensitive (slow ranks slow only their own batch, modulo gradient sync). The TP axis is also straggler-sensitive (every all-reduce waits for the slowest rank). At 1000+ GPU scale, stragglers and blast radius become the second-order concern after parallelism choice.
What this means in practice
- Default: TP=8 inside one HGX H100 (or TP=72 inside NVL72 if the model demands it), PP factors small enough that bubble fraction stays under 10%, DP fills the rest.
- For models 100B-700B: 3D parallelism is the dominant strategy. Add SP for long context, EP for MoE. Megatron-LM and DeepSpeed both implement these compositions.
- For models 13B-70B: TP + DP (with FSDP) is often enough. Skip PP unless the model does not fit in TP * single_node memory.
- For models under 13B: plain DP with FSDP/ZeRO is usually all you need. 3D parallelism is overkill.
- The hardest part is verification. After launching a 3D parallel run, check
nvidia-smi topo -mfor every TP group's NVLink connectivity, NCCL's startup log for which transport each communicator uses, and the per-stage timing in profiler output. Any one of these can be wrong silently. - For orchestration: 3D parallelism imposes hard placement constraints. Slurm's
--switches=Nand--ntasks-per-node=8(or k8s topology hints) must enforce the layout, not advise it.
3D parallelism is the configuration that made 100B+ models trainable. Every large model since GPT-3 uses some variant of it. Knowing how to factor TP * PP * DP into your cluster is one of the highest-impact skills in large-model training.
See also
Updated 2026-05-10