Scale AtlasChapter 5 of 86 termsUpdated 2026-05-10
Parallelism
How a model splits across GPUs. Tensor parallelism shards a single matmul, pipeline parallelism shards the layer stack, FSDP and ZeRO shard optimizer state, sequence parallelism shards the tokens, expert parallelism shards the experts. 3D combines them. Each strategy chooses which collective bill you pay.
3D Parallelism
3D parallelism combines TP times PP times DP on one training run. TP inside an NVLink domain, PP across nodes in a pod, DP across pods. Used by GPT-3, Llama 2 70B, Megatron-Turing NLG.
Expert Parallelism
Expert parallelism routes each token to its top-K experts via all-to-all dispatch. Two all-to-alls per MoE layer make EP bisection-bandwidth-bound.
FSDP vs ZeRO Sharding
ZeRO-1 shards optimizer state, ZeRO-2 adds gradients, ZeRO-3 adds parameters. PyTorch FSDP = ZeRO-3. Costs one all-gather + one reduce-scatter per layer.
Pipeline Parallelism
Pipeline parallelism splits the layer stack across P stages. Different micro-batches run on different stages. Bubble fraction = (P-1)/m where m is micro-batch count.
Sequence Parallelism
Sequence parallelism splits a single sequence's tokens across GPUs. Activation memory shrinks linearly with SP degree. Ring attention rotates K/V chunks around the SP ring.
Tensor Parallelism
Tensor parallelism shards a single matmul across N GPUs by splitting the weight matrix along rows or columns. Costs one all-reduce per matmul. Lives on NVLink.