Row Remapping
Dynamic HBM repair mechanism replacing faulty memory rows on the fly.
What it is
Row remapping is a hardware-level memory repair mechanism introduced on NVIDIA Ampere and later architectures that dynamically substitutes faulty HBM memory rows with spare rows without retiring entire memory pages. Each GPU has a limited bank of spare rows; once exhausted, the GPU falls back to traditional page retirement.
Why it matters
An increasing row remapping rate is an early indicator of memory degradation that precedes page retirement and eventual DBE failures. When the spare row bank is exhausted, the GPU can no longer absorb new faults and page retirement accelerates. Catching row remapping exhaustion early allows proactive GPU replacement before uncorrectable errors corrupt training data.
How to monitor
Track DCGM_FI_DEV_ROW_REMAP_FAILURE for exhaustion events and DCGM_FI_DEV_ROW_REMAP_PENDING for remappings not yet activated. Correlate with DCGM_FI_DEV_RETIRED_SBE growth rate to assess overall HBM degradation trajectory. Factryze monitors remapping velocity alongside ECC trends to flag GPUs on a degradation curve before they reach page retirement thresholds.
DCGM_FI_DEV_ROW_REMAP_FAILURE / DCGM_FI_DEV_ROW_REMAP_PENDINGRelated terms
GPU firmware permanently disabling faulty memory pages after ECC errors.
GPU memory bit-flip errors detected via hardware ECC, signaling degradation.
Double-bit ECC errors that corrupt data and halt computation.
Monitor this automatically
Factryze correlates GPU signals in real time: errors, clocks, and fabric health.
Get Started Free