NCCL Errors
Collective communication failures in NVIDIA NCCL stalling distributed training.
What it is
NCCL errors are failures in NVIDIA's Collective Communications Library, which implements AllReduce, AllGather, Broadcast, and other multi-GPU collective operations used by every major distributed training framework. The most common NCCL error is the watchdog timeout (default 180 seconds, controlled by NCCL_TIMEOUT), which fires when any rank fails to complete its portion of a collective. Other errors include unhandled system error from failed IB verbs, remote peer not found from network topology changes, and invalid usage from mismatched collective calls across ranks.
Why it matters
NCCL errors are notoriously difficult to diagnose because the timeout surfaces on a different rank than the root cause -- a single degraded NVLink at 75% bandwidth on one GPU can timeout all 256 GPUs in a training job. A single straggler rank stalls the entire communicator, destroying throughput across the whole job. Misdiagnosis leads to full job restarts that waste hours of GPU-hours when only one node required remediation.
How to monitor
Correlate NCCL_DEBUG=INFO log output with per-link NVLink bandwidth counters (DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL), InfiniBand port error counters, and Xid events across all nodes in the job. Match the timeline of degradation to identify the first straggler. Factryze automates this cross-node correlation by ingesting NCCL debug logs alongside DCGM telemetry and Xid events, pinpointing the root-cause node and triggering targeted remediation rather than a blind full-job restart.
Related terms
Monitor this automatically
Factryze correlates GPU signals in real time: errors, clocks, and fabric health.
Get Started Free