Uncorrectable Errors (DBE)
Double-bit ECC errors that corrupt data and halt computation.
What it is
Uncorrectable errors (DBE) are double-bit errors in GPU memory that the hardware ECC mechanism cannot correct. Unlike single-bit errors which are silently fixed, a DBE corrupts the affected data and triggers an Xid 48 event in dmesg, a CUDA sticky error surfaced as cudaErrorECCUncorrectable, and immediate termination of the faulting kernel.
Why it matters
Any GPU accumulating uncorrectable errors should be drained and replaced because DBEs indicate a memory cell defect that will worsen over time. The sticky CUDA error poisons the entire CUDA context, meaning all subsequent operations on that device fail even if they do not touch the faulty memory region. A single DBE during a training step can waste all GPU-hours since the last checkpoint.
How to monitor
Track DCGM_FI_DEV_ECC_DBE_VOL_TOTAL for volatile counts (resets on driver reload) and DCGM_FI_DEV_ECC_DBE_AGG_TOTAL for lifetime aggregate counts stored in InfoROM. Cross-reference with Xid 48 events in dmesg to confirm hardware origin. Factryze alerts on any nonzero DBE count and initiates drain and page retirement workflows automatically.
DCGM_FI_DEV_ECC_DBE_VOL_TOTALRelated terms
GPU memory bit-flip errors detected via hardware ECC, signaling degradation.
GPU firmware permanently disabling faulty memory pages after ECC errors.
NVIDIA kernel-logged Xid error codes identifying specific GPU failure modes.
Monitor this automatically
Factryze correlates GPU signals in real time: errors, clocks, and fabric health.
Get Started Free