GPUDirect RDMA
A GPU has 1.8 TB/s of NVLink and an InfiniBand NIC that runs at 50 GB/s per port. Even if you fully provision the NIC ports, the inter-node bandwidth ceiling is roughly 30x lower than NVLink. The number that determines whether you can come close to that ceiling, or settle for a third of it, is whether the bytes ever land in CPU memory on their way through.
The path that does the work
GPUDirect RDMA (GDR) is a kernel feature plus driver glue that lets a Mellanox NIC perform DMA transfers directly to and from a GPU's HBM. The NIC issues PCIe transactions over the GPU's BAR1 (a memory-mapped region of HBM exposed to the PCIe root complex). The kernel module nvidia-peermem (formerly nv_peer_mem) registers the GPU memory with the RDMA stack so that ibverbs can construct work queues that target HBM as both the source and the sink of an RDMA write or read.
Without GDR, the same transfer requires two extra PCIe hops on each side. The GPU first DMAs the buffer into pinned host memory (one PCIe hop, GPU-to-CPU), the NIC then DMAs the buffer from host memory onto the wire (a second PCIe hop, CPU-to-NIC), the wire delivers the bytes to the remote NIC, and the remote NIC DMAs into remote host memory before a final remote PCIe hop into the destination GPU's HBM. Four PCIe hops total instead of zero (or strictly: zero through-CPU hops; the NIC-to-GPU hop on each side is over PCIe but is direct, not bounced through CPU memory).
Why the latency floor changes
Each CPU bounce costs hundreds of nanoseconds of PCIe protocol overhead, plus a kernel context entry to handle the DMA descriptor, plus contention against whatever else is using CPU memory bandwidth. In practice, the difference between GDR and no-GDR on a tuned cluster is roughly 5 microseconds versus 1.5 microseconds for a small RDMA write. The bandwidth difference is even bigger: without GDR, host memory bandwidth (typically 200-400 GB/s on a modern Sapphire Rapids server) becomes the ceiling, and you cannot fully drive an 8-port NDR setup. With GDR, the ceiling is the PCIe Gen5 x16 lane between the GPU and the NIC (~64 GB/s per direction), which is enough to feed 4 NDR ports per GPU at line rate.
What it requires
GDR is not free in setup. The NIC and GPU must be on the same PCIe root complex (or behind a PCIe switch with peer-to-peer enabled), the kernel must have IOMMU configured in passthrough mode (or with explicit allow rules for the BAR1 region), the BAR1 size must be configured large enough in the GPU's firmware to expose the HBM region to PCIe, and the nvidia-peermem kernel module must be loaded before any RDMA application opens the device. On a misconfigured cluster, GDR silently falls back to the bounce path and the only symptom is a 3x bandwidth gap that operators chase for weeks. NCCL emits a log line at startup indicating whether it has detected GDR-capable transports; checking that log line is the fastest way to confirm.
What this means in practice
- For any inter-node training or inference traffic, GDR is the difference between IB being a real transport and being a memory-copy bottleneck. It is the assumption that frameworks like NCCL, NIXL, and UCX bake in.
- For storage, the analog is GPUDirect Storage, which extends the same idea to NVMe targets: the NVMe controller DMAs directly into HBM, skipping the CPU bounce buffer.
- For RDMA fabric choice, GDR works on both InfiniBand and RoCE when paired with a Mellanox/NVIDIA NIC. The GDR feature itself is fabric-agnostic; what matters is that the NIC's driver supports peer memory mapping.
- For debugging: if NCCL's startup log shows that the network plugin is IB but bandwidth is well below NDR line rate, check
nvidia-peermemis loaded (lsmod | grep nvidia_peermem), check IOMMU mode, check BAR1 sizing innvidia-smi -q | grep BAR1. The fix is almost always at the kernel/firmware layer, not the framework layer.
GDR is one of the rare features where you do not see what it is doing until you look for it; the only visible symptom of its absence is a slow training run.
See also
Updated 2026-05-10