Scale AtlasChapter 2 of 89 termsUpdated 2026-05-10
Interconnect
The wires and switches that turn many GPUs into one. NVLink lanes inside the box, NVSwitch crossbars across the rack, InfiniBand or RoCE between racks. Each tier costs a different number of bytes and microseconds, and topology design is the act of choosing which traffic sees which tier.
Bisection Bandwidth
Bisection = total BW between any two halves of the fabric. Figure of merit for all-to-all. A non-blocking fat-tree has bisection ~= N/2 x line rate.
GPUDirect RDMA
GPUDirect RDMA lets the NIC DMA bytes straight from a remote GPU's HBM to the local GPU's HBM, bypassing both CPUs. Without it, IB tops out at PCIe-bounce speed.
InfiniBand NDR vs HDR
NDR (400 Gb/s, 50 GB/s per port) doubles HDR (200 Gb/s, 25 GB/s) per port. Per-node aggregate goes from 200 GB/s to 400 GB/s. Switch radix and rack power both shift.
NVL72 Domain
GB200 NVL72 turns one rack into one NVLink domain: 72 GPUs, 13.5 TB pooled HBM3e, 130 TB/s aggregate fabric. Bigger than any single node has ever been.
NVLink and NVSwitch Topology
NVSwitch is a non-blocking crossbar that gives every GPU a full-rate path to every other GPU inside the node. HGX H100: 4 NVSwitches, 8 GPUs, 7.2 TB/s aggregate.
NVLink Bandwidth Math
NVLink bandwidth = lanes per GPU times per-lane GB/s. H100 NVLink4: 18 x 50 = 900 GB/s. B200 NVLink5: 18 x 100 = 1.8 TB/s. Half is in, half is out.
Rail-Optimized Fat-Tree
8 GPU rails per node land on 8 dedicated leaf switches. All-reduce on rail 0 sees zero contention from all-reduce on rail 1. 1:1 oversubscription leaf-to-spine.
RoCE vs InfiniBand
RoCE v2 is RDMA over UDP/IP with PFC + ECN for losslessness. IB native runs link-level credits. IB is appliance-deterministic; RoCE puts your Ethernet team on the hook.
Topology-Aware Placement
Place TP groups inside one NVLink domain, DP groups across IB. Wrong placement (TP across IB) costs ~50x. Slurm topology.conf and k8s topology hints make it automatic.