NVLink Errors
CRC errors and replay events on NVLink GPU-to-GPU connections.
What it is
NVLink errors are link-level faults on NVIDIA NVLink interconnects, including CRC errors (data corruption in transit), replay events (retransmission of corrupted packets), and recovery events (link retraining after persistent errors). Low-rate CRC errors are normal in high-speed serial links, but sustained elevated rates indicate a degrading physical link.
Why it matters
Sustained elevated CRC or replay rates degrade collective communication throughput and can cause NCCL timeouts that stall entire training jobs. A rapidly increasing replay count often precedes a full link failure reported as Xid 74, which takes the link offline and forces NCCL to reroute or abort. A single degraded NVLink can make the affected GPU the straggler in every AllReduce operation.
How to monitor
Track DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL and DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL per link. Correlate Xid 74 events in dmesg with counter spikes to confirm link-level origin. Factryze monitors per-link NVLink counters continuously and correlates them with NCCL timeout events to identify the root-cause link before a job abort occurs.
DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTALRelated terms
NVIDIA's high-bandwidth interconnect for GPU-to-GPU communication.
Collective communication failures in NVIDIA NCCL stalling distributed training.
NVIDIA kernel-logged Xid error codes identifying specific GPU failure modes.
Monitor this automatically
Factryze correlates GPU signals in real time: errors, clocks, and fabric health.
Get Started Free