Tree-Reduce Latency

The previous term, ring vs tree, introduced the two algorithms NCCL ships and named the crossover. This term zooms into the math underneath that crossover. The α + β·N model is what NCCL's runtime tuner actually evaluates per call; understanding it is the difference between debugging communication time with intuition vs guessing.

The α + β·N model

Every collective's wall-clock cost decomposes into two terms. The first, α, is the per-hop latency floor: the cost of issuing a message on the wire and waiting for the first byte to land at the receiver. It is fabric-dependent and roughly independent of message size. The second, β·N, is the per-byte transfer cost: 1/bandwidth times the number of bytes moved. The total time for one hop is α + β·N. Stitch a collective together and you sum α and β·N over the schedule of hops the algorithm chooses.

Concrete numbers anchor the rest. NVLink small-message latency on HGX H100 lands around 5 to 10 µs end-to-end inside one node. InfiniBand NDR is roughly 2 to 5 µs per switch hop. Add NCCL's dispatch overhead and the practical end-to-end latency floor for the smallest in-node calls is ~10 µs. β is harder to quote (it depends on protocol and channel count), but on H100 NVLink the per-GPU β corresponds to roughly 700 to 800 GB/s of effective bandwidth.

For ring all-reduce the schedule is 2(P-1) sequential steps and each GPU pushes 2(P-1)/P × N bytes total, so the cost is roughly:

T_ring  ≈  2(P-1) · α  +  2(P-1)/P · β · N

For tree all-reduce the schedule is reduce-up plus broadcast-down, depth log₂(P) each way, so the cost is roughly:

T_tree  ≈  2 · log₂(P) · α  +  2 · log₂(P) · β · N

The α coefficients differ by a factor of (P-1)/log₂(P), which is large at scale. The β coefficients differ by a factor of (P-1)/(P · log₂(P)), which approaches 1/log₂(P) for large P. So tree wins on α and ring wins on β. Which term dominates depends entirely on how big N is.

Why tree wins for small messages

When N is small, the α·hops term dwarfs the β·N term. Hop count becomes the only thing that matters, and tree's hop count is dramatically lower. At P=8 the ring takes 14 sequential hops vs the tree's 6. At P=64 it is 126 vs 12. At P=256 it is 510 vs 16. Every hop on the ring is one synchronous round-trip on the fabric; multiply by α and the linear scaling shows up as a real number on the wall clock.

The numerical difference is not subtle. A P=64 ring with α=10 µs spends 1.26 ms on latency alone before any bytes move. The equivalent tree spends 120 µs. For a control message or a small last-layer gradient that is the entire wall-clock budget, and the algorithm choice is the difference between a step that makes its budget and one that misses it.

When the crossover flips

As N grows, β·N starts to dominate the schedule and ring's bandwidth advantage takes over. On an 8x H100 NVSwitch node the crossover sits near ~256 KiB: below that, tree's lower hop count wins despite paying a worse β; above that, ring's bandwidth-optimal schedule wins despite the longer hop chain. The exact location moves with the topology and the channel count.

Lower-bandwidth fabrics push the crossover down: an IB-only multi-node all-reduce has a worse β than NVLink, so ring's advantage kicks in at smaller messages. NVL72, with 1.8 TB/s per GPU and a 72-way domain, has a better β and a slightly worse α (deeper switching), which pushes the crossover up. Channel count also moves it: NCCL parallelizes a ring across multiple SM-driven channels, and more channels lower the effective β. This is why NCCL's runtime tuner does not ship a single hardcoded threshold; it plugs measured α and β into both formulas and picks the winner per call.

What operators control vs what NCCL picks

Setting NCCL_ALGO=Tree or NCCL_ALGO=Ring forces the choice and overrides the tuner. Do not ship that to production. The tuner has measured your fabric; you almost certainly have not. The right diagnostic is NCCL_DEBUG=INFO, which prints the algorithm and protocol the tuner picked for each call. Reading those logs across a real training step is how you find out what your job actually does.

What you do control is the shape of the workload. Gradient bucketing coalesces many small gradient tensors into a few large buffers, which pushes every all-reduce above the crossover into the ring-favored regime. PyTorch DDP's default 25 MB bucket exists for exactly this reason. FSDP pulls the other way: reduce-scatter halves the message size relative to all-reduce, and on smaller models or shallower layers that drag can pull the call back below the crossover and flip NCCL into tree without warning.

Why this matters for training

The crossover decides whether your last-layer all-reduce, often a small latency-bound call, costs roughly 100 µs (tree) or 500 µs (ring forced past its sweet spot) per step. At 100k training steps per epoch, that 400 µs gap is a 40-second swing per epoch. On a multi-week pretrain that is real time, real money, and real opportunity cost; it shows up in throughput numbers but not in any model code review.

The takeaway is that the α + β·N model is not academic. It is the cost function NCCL's tuner evaluates millions of times per training run, and the crossover it produces is what determines whether a given call lives in the latency-bound regime or the bandwidth-bound regime. Knowing which side of the line your messages fall on is how you debug communication time without throwing environment variables at the wall.