Skip to main content
GPU Glossary/Monitoring Metrics
Monitoring Metrics

Retired Pages

Cumulative count of permanently disabled GPU memory pages in InfoROM.

What it is

Retired pages is the cumulative count of GPU memory pages permanently removed from the allocatable pool due to DBE (tracked via DCGM_FI_DEV_RETIRED_DBE) or excessive SBE accumulation (DCGM_FI_DEV_RETIRED_SBE). This count is persisted in InfoROM, survives reboots and driver reloads, and monotonically increases -- pages cannot be un-retired. NVIDIA's replacement guidance recommends RMA when the total exceeds approximately 60 pages.

Why it matters

Retirement velocity matters more than absolute count: a GPU retiring 5+ pages in a single week is on a steep degradation curve and will likely reach the replacement threshold within weeks, while one that accumulated 40 pages over 18 months may remain stable. An H100 showing DCGM_FI_DEV_RETIRED_DBE jumping from 2 to 8 within 48 hours has experienced a cluster of uncorrectable failures concentrated in one HBM stack and is likely to continue at an accelerating rate. Pages pending retirement (DCGM_FI_DEV_RETIRED_PENDING) expose workloads to known-faulty memory until a GPU reset activates them.

How to monitor

Track DCGM_FI_DEV_RETIRED_SBE, DCGM_FI_DEV_RETIRED_DBE, and DCGM_FI_DEV_RETIRED_PENDING continuously. Compute retirement rate (pages per day) and alert on acceleration. Factryze tracks counts and velocity fleet-wide, drains GPUs with accelerating retirement rates, and schedules GPU resets at job boundaries to activate pending retirements rather than leaving workloads on known-faulty pages.

Retired Pages - GPU Memory Page Retirement MonitoringRetired Pages - GPU Memory Page Retirement Monitoring
Pinch to zoom, drag to pan, double-tap to toggle
Retired Pages - GPU Memory Page Retirement MonitoringRetired Pages - GPU Memory Page Retirement Monitoring
DCGM Metric Field
DCGM_FI_DEV_RETIRED_SBE / DCGM_FI_DEV_RETIRED_DBE

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free