NVLink Bandwidth Math

When somebody quotes you a GPU's NVLink number, they are quoting one multiplication. Lanes times per-lane bandwidth. Everything else (NVSwitch fabric, NVL72 rack pooling, the topology of the all-reduce that rides on top) starts from this product.

The two factors

Every NVLink generation is defined by two numbers. The number of lanes the GPU exposes, and the bandwidth each lane runs at. NVIDIA increments both across generations, sometimes one at a time. A100 (NVLink3) had 12 lanes per GPU at 25 GB/s each. H100 (NVLink4) kept the lane count modest at 18 but doubled per-lane to 50 GB/s. B200 (NVLink5) held the lane count at 18 and doubled per-lane again to 100 GB/s. The visible jump from H100 to B200 is one multiplication: 18 x 50 = 900 GB/s, 18 x 100 = 1.8 TB/s.

That number is the bidirectional aggregate per GPU. Half of it goes outbound, half comes inbound. When a vendor data sheet quotes "1.8 TB/s NVLink", that is the sum of both directions. When you are sizing an all-reduce, the relevant number for the bytes-per-second a GPU can put on the wire is half of that, because all-reduce is a duplex operation: every byte sent is matched by a byte received.

Why per-lane matters more than lanes

Per-lane bandwidth is the harder factor to grow. It rides on the SerDes (serializer/deserializer) generation, which moves on a multi-year cycle and is gated by signal integrity, not by chip area. Going from NVLink3's 25 GB/s lanes to NVLink4's 50 GB/s lanes required a SerDes upgrade and tighter PCB routing. Going from NVLink4 to NVLink5 doubled it again, this time with PAM4 modulation that lets each symbol carry two bits. Each of these upgrades changes the cost and complexity of every link in the fabric.

Lane count, by contrast, is a packaging decision. NVIDIA can in principle add more lanes to a GPU package, but every lane consumes die area, package pins, and switch fabric ports on the other end. In practice, the lane count has been stable at 18 since H100 and the bandwidth growth has come almost entirely from per-lane SerDes improvements. This is the same pattern Ethernet and InfiniBand follow: 100G to 200G to 400G to 800G is mostly a per-lane story (NRZ to PAM4 to PAM6), not a "more lanes" story.

Half-duplex math for collective sizing

When a framework runs an all-reduce, NCCL's tuner uses the half-duplex bandwidth, not the bidirectional aggregate, to model beta. On an H100 with 900 GB/s bidirectional NVLink, the per-direction bandwidth is 450 GB/s, and a typical achievable fraction is 80 to 90 percent of that, so NCCL plans against roughly 360 to 400 GB/s of effective per-direction bandwidth. On a B200 with 1.8 TB/s NVLink, the same fraction gives roughly 720 to 800 GB/s effective per-direction bandwidth.

This is why a B200 ring all-reduce is roughly 2x faster than an H100 ring all-reduce on the bandwidth-bound regime. The lane count did not change. The per-lane SerDes did. The end-user effect, on a 256 MiB gradient bucket, is the difference between a 700 microsecond all-reduce and a 350 microsecond one. Multiplied by every step of every training run, that is the entire ROI of a generation hop.

What this means in practice

The single number to memorize per GPU is the NVLink bidirectional aggregate. Halve it for half-duplex sizing. Take 80 to 90 percent of that for what NCCL will actually achieve.
Lane count has been 18 since H100. If you see a number that suggests otherwise (e.g., a per-package "NVLink count" of 4), that is probably counting NVLink ports on the GPU package, not lanes per port. NVIDIA's data sheets sometimes mix the units.
Per-lane bandwidth is the SerDes story. When NVIDIA pre-announces a future NVLink generation, the most useful question is what per-lane rate it targets. That number is what flows downstream into NVSwitch and NVL72 bandwidth, into bisection bandwidth, and into the ring vs tree crossover for collectives.
For half-duplex bandwidth sizing in your own back-of-envelope: aggregate / 2 / 8 GPUs gives the per-GPU per-direction wire rate inside an HGX node. On H100: 7.2 TB/s aggregate / 2 / 8 = 450 GB/s. That is the number ring all-reduce drives toward.

The lane math is small. The downstream consequences are not.

The two factors

Why per-lane matters more than lanes

Half-duplex math for collective sizing

What this means in practice

See also