Skip to main content
GPU Glossary/Errors & Failures
Errors & Failures

Row Remapping

Dynamic HBM repair mechanism replacing faulty memory rows on the fly.

What it is

Row remapping is a hardware-level memory repair mechanism introduced on NVIDIA Ampere and later architectures that dynamically substitutes faulty HBM memory rows with spare rows without retiring entire memory pages. Each GPU has a limited bank of spare rows; once exhausted, the GPU falls back to traditional page retirement.

Why it matters

An increasing row remapping rate is an early indicator of memory degradation that precedes page retirement and eventual DBE failures. When the spare row bank is exhausted, the GPU can no longer absorb new faults and page retirement accelerates. Catching row remapping exhaustion early allows proactive GPU replacement before uncorrectable errors corrupt training data.

How to monitor

Track DCGM_FI_DEV_ROW_REMAP_FAILURE for exhaustion events and DCGM_FI_DEV_ROW_REMAP_PENDING for remappings not yet activated. Correlate with DCGM_FI_DEV_RETIRED_SBE growth rate to assess overall HBM degradation trajectory. Factryze monitors remapping velocity alongside ECC trends to flag GPUs on a degradation curve before they reach page retirement thresholds.

Row Remapping - GPU Memory Self-Healing by Remapping Bad RowsRow Remapping - GPU Memory Self-Healing by Remapping Bad Rows
Pinch to zoom, drag to pan, double-tap to toggle
Row Remapping - GPU Memory Self-Healing by Remapping Bad RowsRow Remapping - GPU Memory Self-Healing by Remapping Bad Rows
DCGM Metric Field
DCGM_FI_DEV_ROW_REMAP_FAILURE / DCGM_FI_DEV_ROW_REMAP_PENDING

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free