Skip to main content

Scale AtlasChapter 3 of 86 termsUpdated 2026-05-10

Collectives

How GPUs share gradients, parameters, and activations during training. NCCL ring saturates the bandwidth, NCCL tree wins on small-message latency, and gradient bucketing, FSDP shards, and compute-comm overlap bend the workload to fit the wires.

computeNVLinkIBforwardbwd L2bwd L1bwd L0AR L2AR L1AR L0inter-node treestep boundaryexposedbackward overlaps with the collective.NVLink runs the intra-node ring; IB runs the inter-node hop.