Retired Pages
Cumulative count of permanently disabled GPU memory pages in InfoROM.
What it is
Retired pages is the cumulative count of GPU memory pages permanently removed from the allocatable pool due to DBE (tracked via DCGM_FI_DEV_RETIRED_DBE) or excessive SBE accumulation (DCGM_FI_DEV_RETIRED_SBE). This count is persisted in InfoROM, survives reboots and driver reloads, and monotonically increases -- pages cannot be un-retired. NVIDIA's replacement guidance recommends RMA when the total exceeds approximately 60 pages.
Why it matters
Retirement velocity matters more than absolute count: a GPU retiring 5+ pages in a single week is on a steep degradation curve and will likely reach the replacement threshold within weeks, while one that accumulated 40 pages over 18 months may remain stable. An H100 showing DCGM_FI_DEV_RETIRED_DBE jumping from 2 to 8 within 48 hours has experienced a cluster of uncorrectable failures concentrated in one HBM stack and is likely to continue at an accelerating rate. Pages pending retirement (DCGM_FI_DEV_RETIRED_PENDING) expose workloads to known-faulty memory until a GPU reset activates them.
How to monitor
Track DCGM_FI_DEV_RETIRED_SBE, DCGM_FI_DEV_RETIRED_DBE, and DCGM_FI_DEV_RETIRED_PENDING continuously. Compute retirement rate (pages per day) and alert on acceleration. Factryze tracks counts and velocity fleet-wide, drains GPUs with accelerating retirement rates, and schedules GPU resets at job boundaries to activate pending retirements rather than leaving workloads on known-faulty pages.
DCGM_FI_DEV_RETIRED_SBE / DCGM_FI_DEV_RETIRED_DBERelated terms
GPU firmware permanently disabling faulty memory pages after ECC errors.
GPU memory bit-flip errors detected via hardware ECC, signaling degradation.
Dynamic HBM repair mechanism replacing faulty memory rows on the fly.
Monitor this automatically
Factryze correlates GPU signals in real time: errors, clocks, and fabric health.
Get Started Free