Skip to main content
GPU Glossary/Errors & Failures
Errors & Failures

Page Retirement

GPU firmware permanently disabling faulty memory pages after ECC errors.

What it is

Page retirement is a GPU firmware mechanism that permanently removes faulty memory pages from the allocatable pool to prevent recurring ECC errors from corrupting workload data. A page is retired due to SBE when it accumulates multiple correctable errors over its lifetime, or immediately upon any DBE on that page. Newly retired pages enter a pending state and do not take effect until a GPU reset or node reboot -- workloads running during this window remain exposed to the faulty page.

Why it matters

The rate of page retirement matters more than the absolute count: a GPU retiring 10 pages in a single day is far more concerning than one that retired 40 pages over two years. NVIDIA's general guidance recommends GPU replacement when the total retired page count exceeds 60 pages, but an A100 retiring 5 pages within an hour of a new DBE is exhibiting a rapidly failing HBM stack and should be drained immediately. Missing pending retirements leaves workloads exposed to known-faulty memory until the next reset.

How to monitor

Track DCGM_FI_DEV_RETIRED_SBE and DCGM_FI_DEV_RETIRED_DBE, both persisted in InfoROM across reboots, alongside DCGM_FI_DEV_RETIRED_PENDING to detect retirements not yet activated. Monitor retirement velocity, not just absolute count. Factryze tracks both counts and velocity, automatically flags GPUs with accelerating rates, schedules resets at job boundaries to activate pending retirements, and initiates proactive drain and RMA workflows.

Page Retirement - GPU Memory Pages Disabled Over TimePage Retirement - GPU Memory Pages Disabled Over Time
Pinch to zoom, drag to pan, double-tap to toggle
Page Retirement - GPU Memory Pages Disabled Over TimePage Retirement - GPU Memory Pages Disabled Over Time
DCGM Metric Field
DCGM_FI_DEV_RETIRED_SBE / DCGM_FI_DEV_RETIRED_DBE

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free