Skip to main content

NVL72 Domain

GB200 NVL72 turns one rack into one NVLink domain: 72 GPUs, 13.5 TB pooled HBM3e, 130 TB/s aggregate fabric. Bigger than any single node has ever been.
Scale
72 GPUs, 18 compute trays, 9 NVSwitch trays
Pooled HBM
13.5 TB visible to any GPU
Fabric
72 x 1.8 TB/s = 130 TB/s aggregate

For most of the modern GPU era, the rack and the box were the same scaling unit. 8 GPUs per HGX, plug racks together with InfiniBand. The H100-era NVLink Switch System hinted at something different (an external NVLink fabric across multiple HGX boxes), but it stayed bolt-on. NVL72 makes the rack itself a single GPU domain.

What is in the rack

A GB200 NVL72 is one rack, but the contents are unfamiliar. 18 compute trays, each holding 4 Blackwell GPUs (2 GB200 superchips, each with 2 GPUs and 1 Grace CPU). 9 NVSwitch trays interleaved with the compute trays. 72 GPUs total. Rather than each compute tray being a self-contained server with its own networking, every NVLink port from every GPU lands directly on the NVSwitch trays in the same rack, which together form a single non-blocking NVLink fabric.

compute (4 GPUs)compute (4 GPUs)NVSwitch traycompute (4 GPUs)compute (4 GPUs)NVSwitch traycompute (4 GPUs)compute (4 GPUs)NVSwitch traycompute (4 GPUs)compute (4 GPUs)NVSwitch traycompute (4 GPUs)compute (4 GPUs)NVSwitch traycompute (4 GPUs)compute (4 GPUs)NVSwitch traycompute (4 GPUs)compute (4 GPUs)NVSwitch traycompute (4 GPUs)compute (4 GPUs)NVSwitch traycompute (4 GPUs)compute (4 GPUs)NVSwitch tray1 NVLinkdomain72 GPUs13.5 TB HBM130 TB/s aggregate72 x 1.8 TB/s = 130 TB/s. 13.5 TB pooled HBM3e visible to any GPU.

The NVSwitch trays are the load-bearing piece. They terminate every NVLink5 lane from every GPU and switch them into the all-to-all fabric. From the GPU's perspective, every other GPU in the rack is one NVLink hop away, with the same alpha and beta you would see between two GPUs on the same HGX baseboard. The traversal is slightly slower (a few hundred extra nanoseconds for the cable run) but the throughput is unchanged.

What 13.5 TB of pooled HBM lets you do

72 Blackwell GPUs at 192 GB of HBM3e each gives you 13,824 GB (13.5 TiB) of HBM in one NVLink domain. That is enough to hold a 1.8 trillion parameter model in FP8 weights with room for activations, optimizer state, and a generous KV cache for inference. For training, it changes which parallelism strategies are local-bandwidth bound versus inter-node bound: you can run tensor-parallel 72-way (or pipeline-parallel 18-way x tensor-parallel 4-way) entirely on NVLink, leaving InfiniBand to handle data parallelism between racks. For inference, 13.5 TB of fast memory lets you serve enormous context windows and large mixture-of-experts models without paying the IB tax on every token.

The 130 TB/s aggregate fabric is 72 x 1.8 TB/s. That is the same arithmetic as the NVSwitch crossbar (N GPUs x per-GPU NVLink BW), just with a different N. The non-blocking property is preserved, which means a 72-way all-reduce inside one rack runs at a beta set by NVLink5, not by IB.

Why the rack became the unit

Two trends made NVL72 inevitable. First, model sizes outgrew a single 8-GPU node years ago, and the gap between NVLink bandwidth and InfiniBand bandwidth widened with every generation. By the Blackwell era, NVLink5 is roughly 18x faster than per-port NDR IB. Crossing a node boundary in a TP group means paying that 18x penalty, which makes large TP groups untenable on traditional 8-GPU nodes.

Second, liquid cooling. The thermal load of 72 Blackwell GPUs in one rack (~120 kW) is impossible with air cooling, but with direct liquid cooling it fits. Once you have committed to liquid in the rack, the marginal cost of adding more GPUs to the same NVLink fabric is small, and the marginal benefit (a bigger non-blocking domain) is large. NVL72 is the natural endpoint of those two trends: every GPU you can fit in a liquid-cooled rack lives in one NVLink domain.

What this means in practice

  • TP and PP groups should fit inside one NVL72 whenever the model size allows. Crossing rack boundaries forces the IB tier and changes the bandwidth ceiling by ~18x.
  • For frameworks, the practical change is that world_size for the local NVLink communicator can now be 72 instead of 8. NCCL's tuner has to deal with longer rings, but the bandwidth term is still set by NVLink, so ring all-reduce stays the dominant algorithm above the latency crossover.
  • The cost of an NVL72 rack is not just the GPUs. It is the 9 NVSwitch trays plus the cooling infrastructure plus the power plant. That cost concentration is the price you pay to have 72 GPUs see each other at NVLink speed.
  • For operations: a single NVL72 rack failing (PDU, CDU, or NVSwitch tray) is now a 72-GPU blast radius event, not a 8-GPU one. The drain-and-replace mechanics scale up accordingly.

NVL72 is the first time a single contiguous NVLink domain crossed the rack boundary in product form. It will not be the last.

See also

Updated 2026-05-10