Networking

GPUDirect RDMA

Direct GPU memory access across the network, bypassing CPU copies.

What it is

GPUDirect RDMA is a technology that allows network adapters (InfiniBand HCAs or NICs) to read from and write directly to GPU memory without staging data through host CPU memory, eliminating two memory copies per transfer. It requires a peer-memory kernel module, compatible GPU and NIC placement on the same PCIe switch, and NUMA-aware configuration.

Why it matters

GPUDirect RDMA reduces AllReduce latency by 30-50% compared to CPU-staged transfers, and is essential for achieving peak inter-node bandwidth in distributed training. Without it, every inter-node NCCL collective incurs unnecessary CPU memory staging that saturates the host memory bus. Misconfigurations that silently fall back to CPU-staged transfers appear as unexpected inter-node bandwidth degradation.

How to monitor

Verify GPUDirect RDMA is active by checking for the nvidia-peermem kernel module (lsmod | grep nvidia_peermem) and confirming GPU-NIC PCIe affinity via nvidia-smi topo -m. Track DCGM_FI_DEV_PCIE_TX_THROUGHPUT alongside inter-node NCCL bandwidth to detect silent fallback to CPU-staged transfers. Factryze validates GPUDirect RDMA configuration as part of node health checks.

Related terms

InfiniBand

High-bandwidth, low-latency network fabric for GPU clusters.

RoCE (RDMA over Converged Ethernet)

RDMA networking over Ethernet for GPU cluster communication.

PCIe (PCI Express)

The host bus connecting GPUs to CPUs and other system devices.

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free