Compute-Communication Overlap

A synchronous data-parallel step has two phases that look like they have to happen in order: compute the gradients, then all-reduce them. If they actually ran in series, every step would pay the full collective time on top of the full backward time, and a 1.5 GB gradient buffer on 64 H100s would burn ten or twenty milliseconds of pure NCCL on every step. They do not run in series. The collective for an early layer's gradient runs while the GPU is still computing the gradient for a later layer. The fabric and the SMs work in parallel for most of the step, and the all-reduce mostly disappears into wall clock that was going to be spent on backward anyway.

Backward unwinds in reverse order

In a transformer, forward computes layers L0 through LN in order. Backward runs the chain rule in reverse: gradients for LN land first, then LN-1, then LN-2, all the way down to L0. Autograd is strict about this ordering because each layer's input gradient is the previous layer's output gradient, so layer LK-1 cannot start until layer LK has written its result.

The opportunity is what happens to layer LN's gradient the instant it is computed. Nothing else needs it on this GPU until the optimizer step at the end of the step. So you can start the all-reduce of LN's gradient immediately, and let it run on the NIC and on NVLink while the SMs keep computing the backward pass for LN-1, LN-2, and the rest of the model. When LN-1's gradient lands, kick off its all-reduce too. By the time backward reaches L0, the all-reduces for every later layer are either done or in flight, and the only collective left to wait for is L0's, which has no remaining backward compute behind it.

Why 60-90% gets hidden

The numbers work because individual layer backward and individual layer all-reduce land in the same order of magnitude on modern hardware. On 8x H100 with NVSwitch, an all-reduce of a 25 MB gradient bucket takes roughly 1 to 2 ms once the ring is in its bandwidth-bound regime. The backward pass of a single transformer layer (attention plus MLP) on a 70B model takes roughly 3 to 10 ms on the same hardware, depending on sequence length and tensor parallel degree. Most all-reduces complete before the next layer's backward finishes, which means the only wall-clock cost they add is the tail.

The exposed fraction is whatever fits outside that overlap window. Across a 70-layer model with sensible bucketing, that exposed slice is small: 60 to 90% of the all-reduce time hides behind backward. The piece that cannot hide is the last layer's all-reduce. The first layer in forward order (L0) is the last to compute its gradient in backward, and once L0's gradient lands there is no L-1 still running backward to overlap with. That collective always shows up on the critical path. Everything else, in the steady state, is free.

How frameworks orchestrate it

PyTorch DDP wires this up with a backward hook on each parameter. When autograd writes a gradient into the parameter's .grad field, DDP's hook fires, the gradient is appended to a bucket, and once the bucket fills DDP launches its all-reduce on a separate CUDA stream. Backward keeps running on the default compute stream; the collective runs on the NCCL stream; both make progress in parallel. optimizer.step() synchronizes both streams before applying updates, so the user code never sees the parallelism, only the lower step time.

The user-visible knob is register_comm_hook. The default hook is the standard all-reduce; you can swap in PowerSGD, FP16 compression, or any custom collective by registering a different hook. FSDP does the same thing one level up: it attaches hooks that fire reduce-scatter on each layer's gradient as it lands, and prefetches the all-gather of the next layer's parameters during forward through the wrap policy. Same overlap principle, different collectives, same last-layer exposure.

What can break the overlap

The overlap is real, but it is not bulletproof. NCCL channel exhaustion is the common one: collectives on the comm stream still consume SMs to drive the rings, and on small GPUs or under aggressive settings they can starve the compute kernels they were supposed to hide behind. Tuning NCCL_NCHANNELS shifts that trade-off.

Hardware contention is the second. PCIe gen5 root complexes can stall when both the compute kernel's HBM traffic and the collective's GPUDirect RDMA traffic want the same lanes; on H100 + NVSwitch this rarely matters, but on PCIe-only nodes it does. Workload shape is the third. On a small model where one layer's backward takes 200 µs and the all-reduce takes 1.5 ms, no amount of overlap will hide the collective; the SMs run out of work before the NIC does. In that regime the term is fabric-bound, and the only fix is fewer, larger steps (gradient accumulation) or smaller gradient buffers (FSDP / ZeRO).

Profile before you tune. NCCL_DEBUG=INFO plus an Nsight Systems trace will show the compute stream and the NCCL stream side by side, and the gap between "all-reduce ends" and "step done" is exactly your last-layer exposure.

What this means in practice

If step time is dominated by exposed all-reduce, increase bucket size or tune NCCL_NCHANNELS to push the ring closer to its bandwidth ceiling. Both move you out of the latency-bound regime where overlap helps least.
The last layer's all-reduce is the irreducible piece on a single optimizer step. Gradient accumulation across micro-batches hides it across step boundaries: while step K's last all-reduce is in flight, step K+1's forward is already running on the same GPU, and the exposure overlaps with compute again.
Do not turn on find_unused_parameters=True unless your model genuinely has conditional parameters. It disables the bucketing the overlap relies on and pushes you back to per-parameter all-reduces, which is exactly the regime overlap was designed to escape.