Scale AtlasChapter 2 of 89 termsUpdated 2026-05-10

Interconnect

The wires and switches that turn many GPUs into one. NVLink lanes inside the box, NVSwitch crossbars across the rack, InfiniBand or RoCE between racks. Each tier costs a different number of bytes and microseconds, and topology design is the act of choosing which traffic sees which tier.

Bisection Bandwidth

Bisection = total BW between any two halves of the fabric. Figure of merit for all-to-all. A non-blocking fat-tree has bisection ~= N/2 x line rate.

Definitionmin cut between any two equal halvesNon-blocking idealN/2 x line rate1024 H100 cluster~12.8 TB/s with NDR rail-optimized

GPUDirect RDMA

GPUDirect RDMA lets the NIC DMA bytes straight from a remote GPU's HBM to the local GPU's HBM, bypassing both CPUs. Without it, IB tops out at PCIe-bounce speed.

PathHBM -> NIC -> wire -> NIC -> HBMBypassesboth host CPUs, both host memoriesRequiresnvidia-peermem, BAR1 mapping, MLX5 driver

InfiniBand NDR vs HDR

NDR (400 Gb/s, 50 GB/s per port) doubles HDR (200 Gb/s, 25 GB/s) per port. Per-node aggregate goes from 200 GB/s to 400 GB/s. Switch radix and rack power both shift.

NVL72 Domain

GB200 NVL72 turns one rack into one NVLink domain: 72 GPUs, 13.5 TB pooled HBM3e, 130 TB/s aggregate fabric. Bigger than any single node has ever been.

NVLink and NVSwitch Topology

NVSwitch is a non-blocking crossbar that gives every GPU a full-rate path to every other GPU inside the node. HGX H100: 4 NVSwitches, 8 GPUs, 7.2 TB/s aggregate.

NVLink Bandwidth Math

NVLink bandwidth = lanes per GPU times per-lane GB/s. H100 NVLink4: 18 x 50 = 900 GB/s. B200 NVLink5: 18 x 100 = 1.8 TB/s. Half is in, half is out.

Rail-Optimized Fat-Tree

8 GPU rails per node land on 8 dedicated leaf switches. All-reduce on rail 0 sees zero contention from all-reduce on rail 1. 1:1 oversubscription leaf-to-spine.

RoCE vs InfiniBand

RoCE v2 is RDMA over UDP/IP with PFC + ECN for losslessness. IB native runs link-level credits. IB is appliance-deterministic; RoCE puts your Ethernet team on the hook.

Topology-Aware Placement

Place TP groups inside one NVLink domain, DP groups across IB. Wrong placement (TP across IB) costs ~50x. Slurm topology.conf and k8s topology hints make it automatic.