Scale AtlasChapter 3 of 86 termsUpdated 2026-05-10
Collectives
How GPUs share gradients, parameters, and activations during training. NCCL ring saturates the bandwidth, NCCL tree wins on small-message latency, and gradient bucketing, FSDP shards, and compute-comm overlap bend the workload to fit the wires.
All-Gather vs Reduce-Scatter
All-reduce decomposes into reduce-scatter (each GPU keeps one slice) plus all-gather (each GPU gets every slice). FSDP runs the halves at different step points to shard parameters and gradients.
Compute-Communication Overlap
Backward pass on layer N runs concurrently with all-reduce on layer N+1's gradients. The collective hides behind compute that would otherwise block. The last layer's collective is always exposed.
Gradient Bucketing
PyTorch DDP groups parameter gradients into 25 MB buckets so the all-reduce ring sees fewer, larger messages, near ring's bandwidth-optimal regime instead of paying latency overhead per parameter.
NCCL All-Reduce: Ring vs Tree
Ring all-reduce moves 2(P-1)/P bytes per GPU at near-line-rate. Tree all-reduce uses ~2·log₂(P) hops and wins on small messages. NCCL picks per call based on message size, channel count, and topology.
NCCL Tuner Environment Variables
Roughly 30 NCCL_* environment variables select algorithms (Ring/Tree/CollNet), protocols (LL/LL128/Simple), channels, and buffer sizes. NCCL_DEBUG=INFO surfaces the actual choice.
Tree-Reduce Latency
For small messages or large GPU counts, tree all-reduce beats ring on wall-clock time despite worse bandwidth. The α + β·N model: tree's log α term wins until messages cross ~256 KiB.