Skip to main content
GPU Glossary/Networking
Networking

NVLink

NVIDIA's high-bandwidth interconnect for GPU-to-GPU communication.

What it is

NVLink is NVIDIA's proprietary high-bandwidth, low-latency interconnect for direct GPU-to-GPU communication, providing up to 900 GB/s bidirectional bandwidth on H100 GPUs (NVLink 4.0). It eliminates the PCIe bottleneck for multi-GPU workloads by enabling GPUs to access each other's memory directly. In large-scale training clusters, NVLink is combined with NVSwitch to create a fully connected GPU fabric within a node.

Why it matters

NVLink bandwidth determines the speed of AllReduce and other collective operations within a node. A single degraded or failed NVLink reduces intra-node collective throughput and forces NCCL to use slower paths, making that node the straggler in every synchronization step. Degraded NVLink bandwidth is invisible to GPU utilization metrics but directly cuts training throughput.

How to monitor

Track DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL per link and compare across all links on the same GPU to spot asymmetric degradation. Correlate with NVLink CRC and replay counters for error-driven bandwidth loss. Factryze maintains a continuously updated NVLink topology model and flags bandwidth anomalies that indicate physical link degradation before they cause NCCL timeouts.

NVLink - Direct GPU-to-GPU CommunicationNVLink - Direct GPU-to-GPU Communication
Pinch to zoom, drag to pan, double-tap to toggle
NVLink - Direct GPU-to-GPU CommunicationNVLink - Direct GPU-to-GPU Communication
DCGM Metric Field
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free