NVLink and NVSwitch Topology

A pile of NVLink lanes does not by itself make a fabric. To get from "each GPU has 900 GB/s of NVLink" to "any GPU can talk to any other GPU at 900 GB/s", something has to switch the lanes. That something is the NVSwitch, and the way it is wired into the box is the difference between a bag of GPUs and a single shared-memory domain.

The crossbar inside the node

NVSwitch is a non-blocking crossbar ASIC. Inside an HGX H100 baseboard, four NVSwitches sit on the same PCB as the 8 GPUs. Every GPU's 18 NVLink lanes are split across the four switches: each GPU sends some lanes to each switch, and each switch terminates lanes from all 8 GPUs. The result is that no matter which GPU pair you pick, there is a path through the switch fabric that carries the full 900 GB/s of NVLink bandwidth without contention from any other pair.

This property is what "non-blocking" means in the topology sense. A non-blocking switch can carry every input port at full rate to every output port simultaneously, as long as no two inputs are trying to reach the same output. For an 8-GPU node, this means all 28 unordered GPU pairs (8 choose 2) can be communicating at full rate at the same time, with the switch silicon arbitrating the lane assignments. NVIDIA's spec sheet number, 7.2 TB/s aggregate fabric bandwidth, is just 8 GPUs x 900 GB/s, with the implicit promise that the switches can actually deliver that aggregate without internal blocking.

Why this changed how clusters are built

Before NVSwitch (the V100 generation and earlier), GPUs talked to each other through a partial NVLink mesh. Some GPU pairs had a direct link, some had two-hop paths through an intermediate GPU, and some had to go through PCIe. Bandwidth depended on which pair you picked. Frameworks had to be topology-aware at the per-link level to avoid the slow paths.

NVSwitch erased that. From the framework's perspective, every GPU pair inside an HGX node looks the same. A ring all-reduce does not care whether GPU 0 and GPU 7 are physically far apart on the baseboard; the ring can be ordered any way and hit the same bandwidth. Tensor parallelism does not need to know whether its 4-GPU group sits on switches 0-1 or switches 2-3. This regularity is why frameworks like NCCL can run a clean ring algorithm on a node and assume it is bandwidth-bound by NVLink, not by topology accidents.

What scales beyond a single node

A single HGX H100 stops at 8 GPUs. To make a bigger NVLink domain, you need to push NVSwitch outside the box and across multiple boxes. NVIDIA's first-generation external NVSwitch (the H100-era SuperPod with NVLink Switch System) extended the domain to 256 GPUs by treating multiple NVSwitches as a multi-stage Clos network and routing NVLink packets across rack boundaries. The B200 generation pushes this further into the NVL72 configuration, where the entire rack (72 GPUs across 18 compute trays + 9 NVSwitch trays) is a single NVLink domain.

The topology trick stays the same as inside a single HGX node: at every layer, every input port has a non-blocking path to every output port. The wires get longer, the latency gets larger by a few microseconds, but the abstraction holds: any GPU sees any other GPU at NVLink rate, with the only cost being a slightly larger alpha in the latency-bandwidth model.

What this means in practice

For collectives that fit inside one node, NVSwitch makes algorithm choice (ring vs tree, see NCCL ring vs tree) the only thing that matters. Topology accidents do not exist inside the node.
Tensor parallelism groups should fit inside one NVSwitch domain. A TP=8 group on H100 lives inside one HGX node and runs at 7.2 TB/s aggregate. A TP=8 group split across two HGX nodes drops to InfiniBand bandwidth, which is roughly 18x slower than NVLink. This is the placement rule that topology-aware placement enforces.
The H100-era external NVLink Switch System (256-GPU domains) and the B200-era NVL72 (72 in one rack) are both attempts to push the "any-to-any" abstraction further out. They cost more per GPU, but they change which parallelism strategies are tractable.
When debugging an unexpectedly slow all-reduce, the question to ask first is whether the GPUs are inside the same NVSwitch domain. If not, the bandwidth ceiling is set by the next tier (IB), not by NCCL.

NVSwitch is the reason "8 H100s" is a single number on a spec sheet rather than a topology you have to solve for.

The crossbar inside the node

Why this changed how clusters are built

What scales beyond a single node

What this means in practice

See also