Skip to main content

Scale AtlasChapter 5 of 86 termsUpdated 2026-05-10

Parallelism

How a model splits across GPUs. Tensor parallelism shards a single matmul, pipeline parallelism shards the layer stack, FSDP and ZeRO shard optimizer state, sequence parallelism shards the tokens, expert parallelism shards the experts. 3D combines them. Each strategy chooses which collective bill you pay.

one transformer blockLayerNormAttn (Q,K,V,O)LayerNormMLP (W1, W2)TPshards the matmulPPshards the layer stackFSDPshards optimizer + paramsSPshards the tokensEPshards the experts (MoE)five strategies, five axes. each chooses a different bandwidth and memory tradeoff.