NVLink
NVIDIA's high-bandwidth interconnect for GPU-to-GPU communication.
What it is
NVLink is NVIDIA's proprietary high-bandwidth, low-latency interconnect for direct GPU-to-GPU communication, providing up to 900 GB/s bidirectional bandwidth on H100 GPUs (NVLink 4.0). It eliminates the PCIe bottleneck for multi-GPU workloads by enabling GPUs to access each other's memory directly. In large-scale training clusters, NVLink is combined with NVSwitch to create a fully connected GPU fabric within a node.
Why it matters
NVLink bandwidth determines the speed of AllReduce and other collective operations within a node. A single degraded or failed NVLink reduces intra-node collective throughput and forces NCCL to use slower paths, making that node the straggler in every synchronization step. Degraded NVLink bandwidth is invisible to GPU utilization metrics but directly cuts training throughput.
How to monitor
Track DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL per link and compare across all links on the same GPU to spot asymmetric degradation. Correlate with NVLink CRC and replay counters for error-driven bandwidth loss. Factryze maintains a continuously updated NVLink topology model and flags bandwidth anomalies that indicate physical link degradation before they cause NCCL timeouts.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTALRelated terms
Monitor this automatically
Factryze correlates GPU signals in real time: errors, clocks, and fabric health.
Get Started Free