Scale AtlasChapter 5 of 86 termsUpdated 2026-05-10

Parallelism

How a model splits across GPUs. Tensor parallelism shards a single matmul, pipeline parallelism shards the layer stack, FSDP and ZeRO shard optimizer state, sequence parallelism shards the tokens, expert parallelism shards the experts. 3D combines them. Each strategy chooses which collective bill you pay.

3D Parallelism

3D parallelism combines TP times PP times DP on one training run. TP inside an NVLink domain, PP across nodes in a pod, DP across pods. Used by GPT-3, Llama 2 70B, Megatron-Turing NLG.

CombinationTP x PP x DP = world_sizeCommonTP=8 inside node, PP=4-16, DP=restWhyno single axis fits a > 100B model

Expert Parallelism

Expert parallelism routes each token to its top-K experts via all-to-all dispatch. Two all-to-alls per MoE layer make EP bisection-bandwidth-bound.

Shardsexpert pool, MoE FFN onlyComm2 all-to-alls per MoE layer (dispatch + combine)Bottleneckbisection bandwidth of fabric

FSDP vs ZeRO Sharding

ZeRO-1 shards optimizer state, ZeRO-2 adds gradients, ZeRO-3 adds parameters. PyTorch FSDP = ZeRO-3. Costs one all-gather + one reduce-scatter per layer.

Pipeline Parallelism

Pipeline parallelism splits the layer stack across P stages. Different micro-batches run on different stages. Bubble fraction = (P-1)/m where m is micro-batch count.

Sequence Parallelism

Sequence parallelism splits a single sequence's tokens across GPUs. Activation memory shrinks linearly with SP degree. Ring attention rotates K/V chunks around the SP ring.

Tensor Parallelism

Tensor parallelism shards a single matmul across N GPUs by splitting the weight matrix along rows or columns. Costs one all-reduce per matmul. Lives on NVLink.