Skip to main content
GPU Glossary/Errors & Failures
Errors & Failures

Uncorrectable Errors (DBE)

Double-bit ECC errors that corrupt data and halt computation.

What it is

Uncorrectable errors (DBE) are double-bit errors in GPU memory that the hardware ECC mechanism cannot correct. Unlike single-bit errors which are silently fixed, a DBE corrupts the affected data and triggers an Xid 48 event in dmesg, a CUDA sticky error surfaced as cudaErrorECCUncorrectable, and immediate termination of the faulting kernel.

Why it matters

Any GPU accumulating uncorrectable errors should be drained and replaced because DBEs indicate a memory cell defect that will worsen over time. The sticky CUDA error poisons the entire CUDA context, meaning all subsequent operations on that device fail even if they do not touch the faulty memory region. A single DBE during a training step can waste all GPU-hours since the last checkpoint.

How to monitor

Track DCGM_FI_DEV_ECC_DBE_VOL_TOTAL for volatile counts (resets on driver reload) and DCGM_FI_DEV_ECC_DBE_AGG_TOTAL for lifetime aggregate counts stored in InfoROM. Cross-reference with Xid 48 events in dmesg to confirm hardware origin. Factryze alerts on any nonzero DBE count and initiates drain and page retirement workflows automatically.

Uncorrectable Errors - Double-Bit Error CascadeUncorrectable Errors - Double-Bit Error Cascade
Pinch to zoom, drag to pan, double-tap to toggle
Uncorrectable Errors - Double-Bit Error CascadeUncorrectable Errors - Double-Bit Error Cascade
DCGM Metric Field
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free