ECC Errors (Error-Correcting Code)
GPU memory bit-flip errors detected via hardware ECC, signaling degradation.
What it is
ECC errors are bit-flip faults in GPU HBM or SRAM detected by the hardware error-correcting code circuitry. Single-bit errors (SBE) are silently corrected in hardware, but an elevated SBE rate -- typically above 1 error per hour sustained -- signals accelerating memory cell degradation and upcoming page retirements. Double-bit errors (DBE) are uncorrectable ECC errors that corrupt data in-flight, trigger Xid 48 in dmesg, and force immediate kernel termination.
Why it matters
ECC error rates are the single most predictive signal for imminent GPU failure. A GPU showing 50+ volatile SBEs within 24 hours is a strong candidate for proactive drain even before a DBE occurs, because SBE bursts precede uncorrectable failures in over 80% of observed cases. Any DBE immediately corrupts in-flight computation and halts the affected kernel.
How to monitor
DCGM exposes both volatile counters (DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, DCGM_FI_DEV_ECC_DBE_VOL_TOTAL) that reset on driver reload and aggregate lifetime counters (DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, DCGM_FI_DEV_ECC_DBE_AGG_TOTAL) stored in InfoROM. Correlate SBE rate trends with row remapping and page retirement events. Factryze tracks both volatile and aggregate counters in real time and automatically drains GPUs that cross configurable degradation thresholds before they impact training jobs.
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL / DCGM_FI_DEV_ECC_DBE_VOL_TOTALRelated terms
NVIDIA kernel-logged Xid error codes identifying specific GPU failure modes.
Double-bit ECC errors that corrupt data and halt computation.
GPU firmware permanently disabling faulty memory pages after ECC errors.
Monitor this automatically
Factryze correlates GPU signals in real time: errors, clocks, and fabric health.
Get Started Free