Bisection Bandwidth

If you can summarize a fabric in one number, that number is the bisection bandwidth. It tells you how much traffic the fabric can carry between any two halves of itself, which is the figure of merit for the collectives that are hardest to satisfy: the all-to-all patterns that mixture-of-experts, tensor-parallelism-across-nodes, and reshard operations use.

What it actually measures

Pick any way to split the cluster into two equal halves (any "bisection"). Count the bandwidth of every link that crosses the cut. Sum it. Now repeat over all possible cuts and take the minimum. That minimum is the bisection bandwidth.

The pessimistic-cut definition matters because the worst-case cut is what bounds the slowest all-to-all collective. If the algorithm needs every GPU on the left half to send data to every GPU on the right half, the total cross-cut traffic is half-the-fabric send to half-the-fabric, and the per-second cap is set by exactly this minimum-cut sum. There is no algorithm trick that gets you past it.

The non-blocking ideal

A perfectly non-blocking fat-tree (1:1 oversubscription at every level) has bisection bandwidth equal to N/2 x per-port line rate, where N is the total number of network endpoints (one IB port per GPU on a typical AI cluster). The intuition is that in a non-blocking topology, every endpoint can saturate its own port, and half of them are sending to the other half, so the cross-cut total equals N/2 ports times line rate.

For a 1024-GPU H100 cluster on NDR (50 GB/s per port, 8 IB ports per H100 node, rail-optimized fat-tree): the per-rail subgraph has 1024 ports, the bisection of one rail is 512 x 50 GB/s = 25.6 TB/s per rail. With 8 parallel rails, the aggregate cross-cut bandwidth (summing all rails) is also bounded by the spine, which has 8 leaves x 32 spine-uplinks x 50 GB/s = 12.8 TB/s. The smaller of the two is the true bisection. For a clean rail-optimized design with 1:1 oversubscription, this comes out to roughly 12.8 TB/s for the example cluster.

Why oversubscription cuts directly into this

Many real clusters compromise to 2:1 oversubscription leaf-to-spine to save on spine switch costs. That halves the bisection bandwidth, since the spine uplinks were the binding constraint. The math: 8 leaves x 16 spine uplinks (instead of 32) x 50 GB/s = 6.4 TB/s. The fabric still carries data-parallel all-reduce at full per-rail speed (because DP collectives travel within a rail and never cross the cut), but any all-to-all that does cross the spine is now ceiling'd at 6.4 TB/s instead of 12.8 TB/s. Whether that matters depends entirely on what fraction of your workload runs all-to-all collectives versus rail-local collectives.

What workloads bind on bisection

The bisection ceiling matters most for:

All-to-all collectives. MoE expert routing (every token to its expert, expert results back to the original GPU) is bisection-bound. So is dist.all_to_all in PyTorch when the data has to redistribute across all GPUs.
Tensor parallelism that spans nodes. When TP forces an all-gather or reduce-scatter across IB (because the TP group is bigger than what fits in one NVL72), the collective needs cross-cut bandwidth proportional to the group size.
Resharding operations. FSDP's all-gather of parameters before forward, when it spans nodes, is essentially an all-to-all-like pattern across the bisection.

For pure data-parallel training where DP groups stay within a rail, bisection is not the binding constraint; per-rail bandwidth is. This is why rail-optimization is so popular: it pushes the binding constraint into the workload-shape question rather than into a per-link bandwidth ceiling.

What this means in practice

Quote bisection in TB/s, not in port count or "non-blocking yes/no". The latter loses information. "12.8 TB/s bisection" is unambiguous; "non-blocking" can mean 1:1 or 2:1 depending on whose marketing slide it came from.
For a back-of-envelope: bisection = (N_GPUs / 2) x line_rate / oversubscription_ratio, where oversubscription_ratio is 1, 2, or 4 depending on the design.
For procurement: ask what fraction of workloads are all-to-all-bound. If under 20%, 2:1 oversubscription is fine. If over 40%, 1:1 is required and 12.8 TB/s for 1024 GPUs is the H100-NDR floor.
For framework choice: NCCL's tuner uses bisection as one of the inputs to its algorithm choice. Larger bisection lets it pick more aggressive parallelism strategies (tensor parallel across nodes, expert parallel) without falling off a bandwidth cliff.
For topology-aware placement: the goal is to keep workloads on the same side of the worst-case cut whenever the algorithm permits. Bisection is the budget; placement is how you stay inside it.

Bisection is the one number that summarizes what the fabric can actually deliver for the hardest collectives. Everything else is either an implementation detail or a special case.