NCCL Errors

Collective communication failures in NVIDIA NCCL stalling distributed training.

What it is

NCCL errors are failures in NVIDIA's Collective Communications Library, which implements AllReduce, AllGather, Broadcast, and other multi-GPU collective operations used by every major distributed training framework. The most common NCCL error is the watchdog timeout (default 180 seconds, controlled by NCCL_TIMEOUT), which fires when any rank fails to complete its portion of a collective. Other errors include unhandled system error from failed IB verbs, remote peer not found from network topology changes, and invalid usage from mismatched collective calls across ranks.

Why it matters

NCCL errors are notoriously difficult to diagnose because the timeout surfaces on a different rank than the root cause -- a single degraded NVLink at 75% bandwidth on one GPU can timeout all 256 GPUs in a training job. A single straggler rank stalls the entire communicator, destroying throughput across the whole job. Misdiagnosis leads to full job restarts that waste hours of GPU-hours when only one node required remediation.

How to monitor

Correlate NCCL_DEBUG=INFO log output with per-link NVLink bandwidth counters (DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL), InfiniBand port error counters, and Xid events across all nodes in the job. Match the timeline of degradation to identify the first straggler. Factryze automates this cross-node correlation by ingesting NCCL debug logs alongside DCGM telemetry and Xid events, pinpointing the root-cause node and triggering targeted remediation rather than a blind full-job restart.

Related terms

NVLink Errors

CRC errors and replay events on NVLink GPU-to-GPU connections.

InfiniBand

High-bandwidth, low-latency network fabric for GPU clusters.

Xid Errors

NVIDIA kernel-logged Xid error codes identifying specific GPU failure modes.

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free