10 terms

Errors & Failures

GPU errors range from correctable single-bit memory flips to catastrophic hardware failures that take an entire node offline. Understanding the difference between an ECC single-bit error (common, correctable, but a leading indicator of hardware degradation) and a double-bit error (rare, fatal, requiring immediate page retirement) is essential for any GPU operations team. This section covers every error type you will encounter when running GPU clusters at scale — from NVIDIA Xid driver errors and NCCL communication failures in distributed training, to silent degradation modes like row remapping exhaustion and PCIe link width reduction. Each term includes the specific DCGM fields to monitor, threshold values that should trigger alerts, and the remediation steps that Factryze agents execute automatically.

Errors & Failures

CUDA Errors

Driver Crash

ECC Errors (Error-Correcting Code)

GPU Fallen Off Bus

NCCL Errors

NVLink Errors

Page Retirement

Row Remapping

Uncorrectable Errors (DBE)

Xid Errors