Scale AtlasChapter 3 of 86 termsUpdated 2026-05-10

Collectives

How GPUs share gradients, parameters, and activations during training. NCCL ring saturates the bandwidth, NCCL tree wins on small-message latency, and gradient bucketing, FSDP shards, and compute-comm overlap bend the workload to fit the wires.

All-Gather vs Reduce-Scatter

All-reduce decomposes into reduce-scatter (each GPU keeps one slice) plus all-gather (each GPU gets every slice). FSDP runs the halves at different step points to shard parameters and gradients.

Per step bytes2(P-1)/P eachFSDP saves(P-1)/P × paramsCost2 collectives / step

Compute-Communication Overlap

Backward pass on layer N runs concurrently with all-reduce on layer N+1's gradients. The collective hides behind compute that would otherwise block. The last layer's collective is always exposed.

Hidden60-90% of all-reduceHookDDP register_comm_hookLimitlast layer exposed

Gradient Bucketing

PyTorch DDP groups parameter gradients into 25 MB buckets so the all-reduce ring sees fewer, larger messages, near ring's bandwidth-optimal regime instead of paying latency overhead per parameter.

NCCL All-Reduce: Ring vs Tree

Ring all-reduce moves 2(P-1)/P bytes per GPU at near-line-rate. Tree all-reduce uses ~2·log₂(P) hops and wins on small messages. NCCL picks per call based on message size, channel count, and topology.

NCCL Tuner Environment Variables

Roughly 30 NCCL_* environment variables select algorithms (Ring/Tree/CollNet), protocols (LL/LL128/Simple), channels, and buffer sizes. NCCL_DEBUG=INFO surfaces the actual choice.

Tree-Reduce Latency

For small messages or large GPU counts, tree all-reduce beats ring on wall-clock time despite worse bandwidth. The α + β·N model: tree's log α term wins until messages cross ~256 KiB.