Networking
GPU cluster performance is ultimately bounded by the interconnect fabric. Intra-node, NVLink and NVSwitch provide the high-bandwidth, low-latency links that allow GPUs to exchange gradients and activations at hundreds of gigabytes per second. Inter-node, InfiniBand and RoCE fabrics carry collective communication traffic across the cluster, and a single degraded port or misconfigured switch can silently reduce distributed training throughput by 30% or more without triggering any application-level error. This section covers the networking primitives essential to GPU infrastructure — from NVLink lane counts and bandwidth specifications, to GPUDirect RDMA for bypassing the CPU in data transfers, to SHARP in-network aggregation that offloads collective operations to the switch fabric. Each term includes monitoring guidance, expected bandwidth baselines, and the degradation signals that Factryze tracks to catch network issues before they impact training jobs.
Adaptive Routing
Dynamic path selection in network switches to avoid congestion.
GPUDirect RDMA
Direct GPU memory access across the network, bypassing CPU copies.
InfiniBand
High-bandwidth, low-latency network fabric for GPU clusters.
Network Fabric
The physical interconnect topology connecting all nodes in a cluster.
NVLink
NVIDIA's high-bandwidth interconnect for GPU-to-GPU communication.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTALNVSwitch
NVIDIA's NVLink switch enabling all-to-all GPU communication.
Packet Drops
Lost network packets indicating congestion or hardware errors.
PCIe (PCI Express)
The host bus connecting GPUs to CPUs and other system devices.
DCGM_FI_DEV_PCIE_TX_THROUGHPUT / DCGM_FI_DEV_PCIE_RX_THROUGHPUTRoCE (RDMA over Converged Ethernet)
RDMA networking over Ethernet for GPU cluster communication.
SHARP (Scalable Hierarchical Aggregation and Reduction Protocol)
In-network compute for accelerating collective operations.