NCCL All-Reduce: Ring vs Tree

All-reduce is the single most expensive collective in synchronous data-parallel training. Every step, every GPU, every gradient tensor. The algorithm NCCL picks for you often matters more than the model code you wrote.

The two algorithms NCCL ships

Ring all-reduce arranges P GPUs in a logical ring and chunks the payload into P pieces. Each GPU sends a chunk forward and receives the next one back, alternating reduce-scatter and all-gather phases. Tree all-reduce builds a logical binary tree across the same GPUs, reduces up to a root, and broadcasts back down. NCCL also has CollNet (in-network reduction on supported NVSwitch and InfiniBand hardware) and a few specialized variants, but for an off-the-shelf cluster running PyTorch DDP or Megatron-LM, the choice at runtime is almost always ring vs tree.

The decision is per call. Same job, same GPUs, same communicator: a 4 MiB gradient bucket might go ring, the next 8 KiB control message goes tree, the next 64 MiB FSDP all-gather goes ring again. NCCL picks based on message size, channel count, and the inferred topology (NVLink fabric inside a node, InfiniBand or RoCE between nodes).

Ring: bandwidth-optimal at scale

The ring algorithm is bandwidth-optimal in a precise sense: it hits the lower bound for what every GPU has to put on the wire. Each GPU sends and receives exactly 2(P-1)/P × N bytes total over the course of one all-reduce of size N, where P is the world size. As P grows the factor approaches 2N from below, which matches the trivial bound that every byte must be summed (one pass) and then redistributed (a second pass). You cannot do an all-reduce in fewer bytes per GPU than ring does, modulo CollNet's in-network tricks.

That bandwidth efficiency is why ring dominates large messages. On an HGX H100 node, NVLink + NVSwitch gives each GPU 900 GB/s of bidirectional fabric bandwidth. A ring all-reduce of a 256 MiB tensor across 8 H100s pushes 2(P-1)/P × 256 MiB ≈ 448 MiB through each GPU's fabric port, and at 900 GB/s of available bandwidth that step finishes in under a millisecond. "Near-line-rate" is not marketing; if you trace it with NCCL's profiler you will see effective bandwidth in the 700 to 800 GB/s range on a tuned node, which is the line rate minus protocol and dependency overhead.

The cost ring pays is hop count. A ring all-reduce takes 2(P-1) sequential steps. Each step is one network round-trip at minimum, which means the latency floor scales linearly with P. For 8 GPUs that is 14 hops; for 64 GPUs it is 126; for a thousand-GPU multi-node ring it is starting to be a real number measured in milliseconds before any data moves. The latency cost compounds with P, so ring wins exactly when the message is large enough that the bandwidth term dominates the latency term.

Tree: latency-optimal for small messages

Tree all-reduce trades bandwidth efficiency for hop count. A binary tree of P GPUs has depth log₂(P), and one all-reduce traverses up and back down: ~2·log₂(P) hops. For 8 GPUs that is 6 hops vs ring's 14; for 1024 GPUs it is 20 hops vs ring's 2046. The depth scales sub-linearly while ring's depth scales linearly, so as P grows the latency gap widens dramatically.

The price is bandwidth. Ring's per-GPU β coefficient is 2(P-1)/P × N, asymptotically 2N. Tree's per-GPU β coefficient is 2·log₂(P) × N. The ratio of the two is roughly log₂(P): at P=8 tree pays ~3× the bandwidth cost of ring; at P=1024 it pays ~10×. On large messages this gap is brutal, and tree falls behind ring quickly. On small messages it does not matter, because small messages are dominated by the per-hop latency cost.

The mental model NCCL uses internally is the standard α + β·N latency-bandwidth model: α is the per-hop latency, β is the inverse bandwidth, N is the message size. Ring's total cost is roughly 2(P-1)·α + 2(P-1)/P · β·N, tree's is roughly 2·log₂(P)·α + 2·log₂(P) · β·N. The α term favors tree, the β term favors ring. Where they cross depends on the actual α and β of the fabric you are running on. The deeper derivation lives in tree-reduce latency.

Why NCCL picks per call

NCCL's runtime tuner (the cost model in tuning.cc plus topology-aware overrides in graph/tuning.cc) plugs measured α and β values for the detected fabric into both formulas, evaluates them at the message size of the current call, and picks the cheaper one. On an 8x H100 NVSwitch node the crossover sits near ~256 KiB: below that, tree wins; above, ring wins. The exact number shifts with channel count (more channels means more parallel rings, which lowers ring's effective β and pushes the crossover up), with NVL72 (a 72-way NVLink domain has different α and β than 8-way HGX, so the crossover moves), and with multi-node fabrics (InfiniBand NDR has higher α than NVLink, so the crossover sits at a larger message size when the call has to leave the node).

Setting NCCL_DEBUG=INFO surfaces which algorithm and protocol NCCL picked for each call. The relevant log lines name the algorithm and protocol explicitly (Ring or Tree, with the protocol enum LL, LL128, or Simple), and reading a few thousand of them across a real training step is the only honest way to know what your job is actually doing. The tuner is usually right; when it is wrong, you can override with NCCL_ALGO=Ring or NCCL_ALGO=Tree, but you should measure first.

What this means in practice

Operators cannot safely override NCCL's choice without running real all-reduce benchmarks on the actual fabric. The right default is to leave the tuner alone and verify with NCCL_DEBUG=INFO that it is making sensible picks.
Frameworks that bucket gradients (PyTorch DDP's default 25 MB bucket, Megatron's gradient accumulation buffers) push every all-reduce comfortably above the ring crossover. This is one of the reasons gradient bucketing is the single biggest knob on data-parallel throughput.
FSDP changes the regime. Reduce-scatter halves the message size relative to all-reduce, which can drag a call back below the ring crossover and flip NCCL into tree without any change on your side. If you migrate from DDP to FSDP and your communication time gets weird, the algorithm choice is one of the first things to check. See all-gather vs reduce-scatter for the decomposition that drives this.

The takeaway: ring is what makes large-batch synchronous training tractable; tree is what keeps the small messages from dominating the wall clock. NCCL picks per call so you do not have to. Your job is to understand the regime your framework is putting NCCL into, not to micromanage the algorithm.

The two algorithms NCCL ships

Ring: bandwidth-optimal at scale

Tree: latency-optimal for small messages

Why NCCL picks per call

What this means in practice

See also