Rail-Optimized Fat-Tree
A fat-tree is a Clos network with enough uplink bandwidth at every layer that no single hop becomes the bottleneck. Adding "rail-optimized" on top of that reorganizes how GPUs map to leaf switches, so that the collectives a real training job runs do not fight each other for bandwidth.
The cabling rule
In a generic fat-tree, every node has multiple IB ports and they all connect to the same set of leaf switches in some round-robin pattern. In a rail-optimized fat-tree, each GPU's IB port connects to a specific leaf, and the same GPU index across every node always lands on the same leaf. So GPU 0 from node 0, GPU 0 from node 1, GPU 0 from node 2, ..., all land on leaf 0. GPU 1 from every node lands on leaf 1. And so on, up to GPU 7 to leaf 7.
The result is that there are 8 parallel "rails" through the fabric, one per GPU index. Each rail has its own leaf switch, its own uplinks to the spine, and its own slice of the bandwidth. Two collectives running on different rails (e.g., rail 0 doing data-parallel all-reduce while rail 1 is also doing data-parallel all-reduce) do not contend for any switch port. They are physically separated by the cabling.
Why this matters for collectives
A typical training job has GPUs partitioned into groups, and each group runs its own collective. Data parallelism partitions across the rail dimension: rank 0 of every DP group sits on GPU 0 of its node, rank 1 sits on GPU 1, and so on. With rail-optimized topology, every DP group's all-reduce uses exactly one rail, and the 8 DP groups (one per rail) run in parallel without contention.
Compare this to a non-rail-optimized fabric: each node's 8 GPUs spray across all leaves, so one all-reduce on rank 0 of the DP group might land on leaf 3 for some nodes and leaf 7 for others, depending on cabling. Different DP groups running concurrently can collide on the same leaf, contending for the same uplink bandwidth, and the all-reduce slows down by a factor that depends on how many groups happen to share a leaf.
The rail-optimized layout makes the topology match the algorithm. NCCL's tuner can then assume that data-parallel all-reduce sees the full per-rail bandwidth without contention, and that assumption holds in practice as long as the cabling is correct.
1:1 oversubscription is the rule
A rail-optimized fat-tree is typically built non-blocking, meaning the leaf-to-spine bandwidth equals the leaf-to-node bandwidth. If a leaf has 32 downlinks (one per node) at 400 Gb/s NDR, it has 32 uplinks at 400 Gb/s to the spine. This is "1:1 oversubscription". Some clusters compromise to "2:1" (32 down, 16 up) to save on spine switch ports; this halves the bisection bandwidth and starts to bite on all-to-all collectives, but is invisible to per-rail data-parallel all-reduce.
For tensor parallelism that is forced across nodes (i.e., TP groups bigger than what fits in one NVL72 or one HGX), rail-optimization does not help directly: TP groups don't sit on a single rail, they span multiple GPUs across multiple ranks. For these patterns, all-to-all collectives do hit the spine and bisection bandwidth becomes the relevant metric. This is why the right placement strategy keeps TP inside an NVLink domain and DP across rails (see topology-aware placement).
What this means in practice
- The cabling is the load-bearing piece. Rail-optimization is a wiring convention, not a switch feature. A leaf switch does not know it is rail-optimized; it just sees its assigned ports. Get the cabling wrong (one node's GPU 3 plugged into leaf 5 by mistake) and you have a non-rail-optimized fabric with mostly correct intent.
- The benefit is concrete: parallel data-parallel all-reduce across rails runs at the full per-rail line rate, with no contention. This is the topology assumption every modern cluster design starts from.
- The benefit applies most cleanly to data-parallel patterns. Tensor-parallel and expert-parallel patterns that span multiple ranks across multiple nodes still cross the spine, and bisection bandwidth limits apply.
- For 64-node H100 clusters with NDR: 8 leaves at 32 ports each (down) and 32 ports each (up), 1 spine with 8 x 32 = 256 spine downlinks. The bisection bandwidth on this rail-optimized fat-tree is
8 leaves x 32 spine uplinks x 50 GB/s = 12.8 TB/s. That is the figure of merit for any all-to-all collective. - For procurement: the rail-optimized layout costs the same as a non-rail-optimized layout (same number of switches, same number of cables), it just requires consistent cabling. The benefit is operational predictability of all-reduce performance. Most cluster designs make rail optimization a hard requirement rather than an option.
The clever part is that the wires care about the algorithm. Rail optimization is a wiring decision that bakes the data-parallel-all-reduce assumption into the physical fabric.
See also
Updated 2026-05-10