Pipeline Parallelism
A 70B model has 80 layers, give or take. If you split those layers into 4 chunks of 20 layers each, and put each chunk on a different GPU (or group of GPUs), you have pipeline parallelism. The trick is keeping all 4 stages busy at once.
What gets sharded and what flows
Pipeline parallelism (PP) shards the layer stack. Stage 0 owns layers 0 through 19, stage 1 owns 20-39, and so on. The activations flow forward stage-to-stage during the forward pass, gradients flow backward stage-to-stage during backward. The communication between stages is point-to-point: stage k sends its output to stage k+1, no broadcast or reduction needed. This makes PP much friendlier to InfiniBand than tensor parallelism: a few hundred microseconds of P2P latency per stage boundary is tolerable, especially since each stage runs a sizable chunk of compute between hops.
The catch is that a single forward pass through 4 stages costs 4 sequential stage times. If you only had one batch in flight, stage 1 would sit idle while stage 0 did its work, and so on. PP earns its keep by splitting each batch into many smaller "micro-batches" and pipelining them through the stages.
The bubble
The fundamental cost of PP is the bubble: the time at the start of a step when later stages have not received any inputs yet, plus the symmetric time at the end when earlier stages have already finished. For a P-stage pipeline running m micro-batches, the bubble fraction of total step time is roughly (P - 1) / m. With P=4 and m=8, that is 3/8 = 37.5% idle time, which is a brutal hit. With P=4 and m=64, it falls to 3/64 = 4.7%, which is fine.
The way to eliminate the bubble is to use more micro-batches than stages, by a healthy margin. The cost is memory: each micro-batch in flight needs activation memory for its forward pass to be available for backward. Megatron-LM and DeepSpeed both implement the "1F1B" (one-forward-one-backward) interleaving schedule, where each stage alternates between processing the next micro-batch's forward and the previous one's backward. This caps the activation memory at P micro-batches per stage instead of m, which is what makes the high micro-batch counts tractable.
Newer variants (interleaved 1F1B, virtual pipeline) split each stage into multiple "virtual stages" so the same physical GPU runs multiple non-contiguous chunks of the layer stack. This further reduces the bubble at the cost of more cross-stage communication. The DeepSpeed-Megatron 3D parallelism papers benchmark this and the residual bubble can be brought under 5% on most production setups.
Why PP fights with batch size
The micro-batch count m is bounded by the global batch size and the data-parallel degree. If you train at global batch 1024 with DP=64 and PP=4, each DP rank gets 16 samples, which can split into m=16 micro-batches per step. With m=16 and P=4, the bubble fraction is 3/16 = 18.75%. To shrink the bubble, you either grow the global batch (which has its own optimization implications), shrink DP, or increase the per-micro-batch batch size and live with fewer of them.
The bubble math gets tight at very large model scale. GPT-3 trained with PP=8, which makes the bubble 7/m of the step. Megatron-LM's GPT-3 training run used m around 64 to get the bubble below 11%. Llama 2 70B trained with PP=4 and m=128 to get below 3%. The bubble is the reason "more pipeline stages" is not always cheap; you need correspondingly more micro-batches, which costs activation memory and changes optimization dynamics.
What PP does well
- It scales weight memory linearly. P=4 stages means each stage holds 1/4 of the model weights. Combined with tensor parallelism inside each stage, you get TP * PP scaling: TP=8 PP=4 puts a 70B model in 32 GPUs.
- It tolerates IB. The cross-stage P2P sends are small relative to the per-stage compute; placing PP across nodes is fine. See topology-aware placement for the placement rule.
- It is orthogonal to TP and DP. The same training run can use TP for intra-node sharding, PP across nodes for layer sharding, and DP for replication, multiplied together as 3D parallelism.
What PP does poorly
- The bubble is real. Below m=4*P, the bubble fraction is over 75%, which is unworkable. PP forces a high micro-batch count, which forces a large global batch.
- Activation memory in flight scales with
P * micro_batch_size. Long sequences and large hidden dimensions push activation memory up faster than weight savings come down. - Stragglers in PP are catastrophic. If stage 2 runs 5% slower than the others, all stages run at 5% slower, because the pipeline runs at the speed of the slowest stage. See thermal stragglers and stragglers and blast radius.
- Convergence dynamics are subtle: the gradient on micro-batch 0 was computed against weights that are now several micro-batches old. This is "stale weights" and is mostly fine for transformer training, but it is a real source of optimizer noise.
What this means in practice
- Choose PP degree based on memory pressure (how much weight does not fit in TP-sized groups), not throughput. PP buys memory; TP buys throughput.
- Pick micro-batch count m to keep
(P-1)/munder 10%. For P=4, that means m greater than or equal to 32. For P=8, m greater than or equal to 64. - Use 1F1B scheduling, not GPipe's naive forward-then-backward. 1F1B caps activation memory at
Pmicro-batches in flight rather thanm. - For long sequences, watch activation memory. PP plus long context plus large micro-batches is the configuration that runs out of HBM most often. Sequence parallelism and activation checkpointing both help.
- Place PP stages across nodes, since the P2P sends tolerate IB latency. Verify with
nvidia-smi topo -mthat ranks within the same TP group are co-located on the same NVLink domain.
PP is the parallelism axis you reach for when TP has run out of NVLink domain. It is cheaper in bandwidth, expensive in memory, and only good if you can feed it enough micro-batches.
See also
Updated 2026-05-10