Networking

NVLink

NVIDIA's high-bandwidth interconnect for GPU-to-GPU communication.

What it is

NVLink is NVIDIA's proprietary high-bandwidth, low-latency interconnect for direct GPU-to-GPU communication, providing up to 900 GB/s bidirectional bandwidth on H100 GPUs (NVLink 4.0). It eliminates the PCIe bottleneck for multi-GPU workloads by enabling GPUs to access each other's memory directly. In large-scale training clusters, NVLink is combined with NVSwitch to create a fully connected GPU fabric within a node.

Why it matters

NVLink bandwidth determines the speed of AllReduce and other collective operations within a node. A single degraded or failed NVLink reduces intra-node collective throughput and forces NCCL to use slower paths, making that node the straggler in every synchronization step. Degraded NVLink bandwidth is invisible to GPU utilization metrics but directly cuts training throughput.

How to monitor

Track DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL per link and compare across all links on the same GPU to spot asymmetric degradation. Correlate with NVLink CRC and replay counters for error-driven bandwidth loss. Factryze maintains a continuously updated NVLink topology model and flags bandwidth anomalies that indicate physical link degradation before they cause NCCL timeouts.

DCGM Metric Field

DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL

Related terms

NVSwitch

NVIDIA's NVLink switch enabling all-to-all GPU communication.

NVLink Errors

CRC errors and replay events on NVLink GPU-to-GPU connections.

GPUDirect RDMA

Direct GPU memory access across the network, bypassing CPU copies.

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free