Skip to main content
GPU Glossary/Errors & Failures
Errors & Failures

NVLink Errors

CRC errors and replay events on NVLink GPU-to-GPU connections.

What it is

NVLink errors are link-level faults on NVIDIA NVLink interconnects, including CRC errors (data corruption in transit), replay events (retransmission of corrupted packets), and recovery events (link retraining after persistent errors). Low-rate CRC errors are normal in high-speed serial links, but sustained elevated rates indicate a degrading physical link.

Why it matters

Sustained elevated CRC or replay rates degrade collective communication throughput and can cause NCCL timeouts that stall entire training jobs. A rapidly increasing replay count often precedes a full link failure reported as Xid 74, which takes the link offline and forces NCCL to reroute or abort. A single degraded NVLink can make the affected GPU the straggler in every AllReduce operation.

How to monitor

Track DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL and DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL per link. Correlate Xid 74 events in dmesg with counter spikes to confirm link-level origin. Factryze monitors per-link NVLink counters continuously and correlates them with NCCL timeout events to identify the root-cause link before a job abort occurs.

NVLink Errors - CRC, Replay, and RecoveryNVLink Errors - CRC, Replay, and Recovery
Pinch to zoom, drag to pan, double-tap to toggle
NVLink Errors - CRC, Replay, and RecoveryNVLink Errors - CRC, Replay, and Recovery
DCGM Metric Field
DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free