Skip to main content
GPU Glossary/Errors & Failures
Errors & Failures

ECC Errors (Error-Correcting Code)

GPU memory bit-flip errors detected via hardware ECC, signaling degradation.

What it is

ECC errors are bit-flip faults in GPU HBM or SRAM detected by the hardware error-correcting code circuitry. Single-bit errors (SBE) are silently corrected in hardware, but an elevated SBE rate -- typically above 1 error per hour sustained -- signals accelerating memory cell degradation and upcoming page retirements. Double-bit errors (DBE) are uncorrectable ECC errors that corrupt data in-flight, trigger Xid 48 in dmesg, and force immediate kernel termination.

Why it matters

ECC error rates are the single most predictive signal for imminent GPU failure. A GPU showing 50+ volatile SBEs within 24 hours is a strong candidate for proactive drain even before a DBE occurs, because SBE bursts precede uncorrectable failures in over 80% of observed cases. Any DBE immediately corrupts in-flight computation and halts the affected kernel.

How to monitor

DCGM exposes both volatile counters (DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, DCGM_FI_DEV_ECC_DBE_VOL_TOTAL) that reset on driver reload and aggregate lifetime counters (DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, DCGM_FI_DEV_ECC_DBE_AGG_TOTAL) stored in InfoROM. Correlate SBE rate trends with row remapping and page retirement events. Factryze tracks both volatile and aggregate counters in real time and automatically drains GPUs that cross configurable degradation thresholds before they impact training jobs.

ECC - What Happens When a GPU Memory Bit FlipsECC - What Happens When a GPU Memory Bit Flips
Pinch to zoom, drag to pan, double-tap to toggle
ECC - What Happens When a GPU Memory Bit FlipsECC - What Happens When a GPU Memory Bit Flips
DCGM Metric Field
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL / DCGM_FI_DEV_ECC_DBE_VOL_TOTAL

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free