Skip to main content
GPU Glossary/Errors & Failures
Errors & Failures

NCCL Errors

Collective communication failures in NVIDIA NCCL stalling distributed training.

What it is

NCCL errors are failures in NVIDIA's Collective Communications Library, which implements AllReduce, AllGather, Broadcast, and other multi-GPU collective operations used by every major distributed training framework. The most common NCCL error is the watchdog timeout (default 180 seconds, controlled by NCCL_TIMEOUT), which fires when any rank fails to complete its portion of a collective. Other errors include unhandled system error from failed IB verbs, remote peer not found from network topology changes, and invalid usage from mismatched collective calls across ranks.

Why it matters

NCCL errors are notoriously difficult to diagnose because the timeout surfaces on a different rank than the root cause -- a single degraded NVLink at 75% bandwidth on one GPU can timeout all 256 GPUs in a training job. A single straggler rank stalls the entire communicator, destroying throughput across the whole job. Misdiagnosis leads to full job restarts that waste hours of GPU-hours when only one node required remediation.

How to monitor

Correlate NCCL_DEBUG=INFO log output with per-link NVLink bandwidth counters (DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL), InfiniBand port error counters, and Xid events across all nodes in the job. Match the timeline of degradation to identify the first straggler. Factryze automates this cross-node correlation by ingesting NCCL debug logs alongside DCGM telemetry and Xid events, pinpointing the root-cause node and triggering targeted remediation rather than a blind full-job restart.

NCCL Errors - Communication Failure in Distributed TrainingNCCL Errors - Communication Failure in Distributed Training
Pinch to zoom, drag to pan, double-tap to toggle
NCCL Errors - Communication Failure in Distributed TrainingNCCL Errors - Communication Failure in Distributed Training

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free