Skip to main content
10 terms

Networking

GPU cluster performance is ultimately bounded by the interconnect fabric. Intra-node, NVLink and NVSwitch provide the high-bandwidth, low-latency links that allow GPUs to exchange gradients and activations at hundreds of gigabytes per second. Inter-node, InfiniBand and RoCE fabrics carry collective communication traffic across the cluster, and a single degraded port or misconfigured switch can silently reduce distributed training throughput by 30% or more without triggering any application-level error. This section covers the networking primitives essential to GPU infrastructure — from NVLink lane counts and bandwidth specifications, to GPUDirect RDMA for bypassing the CPU in data transfers, to SHARP in-network aggregation that offloads collective operations to the switch fabric. Each term includes monitoring guidance, expected bandwidth baselines, and the degradation signals that Factryze tracks to catch network issues before they impact training jobs.