ECC Errors (Error-Correcting Code)

GPU memory bit-flip errors detected via hardware ECC, signaling degradation.

What it is

ECC errors are bit-flip faults in GPU HBM or SRAM detected by the hardware error-correcting code circuitry. Single-bit errors (SBE) are silently corrected in hardware, but an elevated SBE rate -- typically above 1 error per hour sustained -- signals accelerating memory cell degradation and upcoming page retirements. Double-bit errors (DBE) are uncorrectable ECC errors that corrupt data in-flight, trigger Xid 48 in dmesg, and force immediate kernel termination.

Why it matters

ECC error rates are the single most predictive signal for imminent GPU failure. A GPU showing 50+ volatile SBEs within 24 hours is a strong candidate for proactive drain even before a DBE occurs, because SBE bursts precede uncorrectable failures in over 80% of observed cases. Any DBE immediately corrupts in-flight computation and halts the affected kernel.

How to monitor

DCGM exposes both volatile counters (DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, DCGM_FI_DEV_ECC_DBE_VOL_TOTAL) that reset on driver reload and aggregate lifetime counters (DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, DCGM_FI_DEV_ECC_DBE_AGG_TOTAL) stored in InfoROM. Correlate SBE rate trends with row remapping and page retirement events. Factryze tracks both volatile and aggregate counters in real time and automatically drains GPUs that cross configurable degradation thresholds before they impact training jobs.

DCGM Metric Field

DCGM_FI_DEV_ECC_SBE_VOL_TOTAL / DCGM_FI_DEV_ECC_DBE_VOL_TOTAL

Related terms

Xid Errors

NVIDIA kernel-logged Xid error codes identifying specific GPU failure modes.

Uncorrectable Errors (DBE)

Double-bit ECC errors that corrupt data and halt computation.

Page Retirement

GPU firmware permanently disabling faulty memory pages after ECC errors.

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free