Errors & Failures
GPU errors range from correctable single-bit memory flips to catastrophic hardware failures that take an entire node offline. Understanding the difference between an ECC single-bit error (common, correctable, but a leading indicator of hardware degradation) and a double-bit error (rare, fatal, requiring immediate page retirement) is essential for any GPU operations team. This section covers every error type you will encounter when running GPU clusters at scale — from NVIDIA Xid driver errors and NCCL communication failures in distributed training, to silent degradation modes like row remapping exhaustion and PCIe link width reduction. Each term includes the specific DCGM fields to monitor, threshold values that should trigger alerts, and the remediation steps that Factryze agents execute automatically.
CUDA Errors
CUDA runtime and driver API error codes indicating GPU compute failures.
Driver Crash
GPU kernel driver panic or hang requiring intervention to recover.
ECC Errors (Error-Correcting Code)
GPU memory bit-flip errors detected via hardware ECC, signaling degradation.
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL / DCGM_FI_DEV_ECC_DBE_VOL_TOTALGPU Fallen Off Bus
Xid 79 error: GPU completely disconnects from the PCIe bus.
NCCL Errors
Collective communication failures in NVIDIA NCCL stalling distributed training.
NVLink Errors
CRC errors and replay events on NVLink GPU-to-GPU connections.
DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTALPage Retirement
GPU firmware permanently disabling faulty memory pages after ECC errors.
DCGM_FI_DEV_RETIRED_SBE / DCGM_FI_DEV_RETIRED_DBERow Remapping
Dynamic HBM repair mechanism replacing faulty memory rows on the fly.
DCGM_FI_DEV_ROW_REMAP_FAILURE / DCGM_FI_DEV_ROW_REMAP_PENDINGUncorrectable Errors (DBE)
Double-bit ECC errors that corrupt data and halt computation.
DCGM_FI_DEV_ECC_DBE_VOL_TOTALXid Errors
NVIDIA kernel-logged Xid error codes identifying specific GPU failure modes.