DCGM (Data Center GPU Manager)
NVIDIA's GPU management toolkit exposing health metrics via field IDs.
What it is
DCGM (Data Center GPU Manager) is NVIDIA's comprehensive suite for managing and monitoring GPUs in cluster environments, providing health monitoring, active diagnostics, policy-based governance, and system validation through C/Python APIs, the dcgmi CLI, and the dcgm-exporter Prometheus endpoint. It exposes over 200 GPU metric field IDs covering utilization (DCGM_FI_DEV_GPU_UTIL), temperature (DCGM_FI_DEV_GPU_TEMP), power (DCGM_FI_DEV_POWER_USAGE), ECC errors, NVLink bandwidth, PCIe throughput, and clock frequencies. DCGM diagnostic levels range from Level 1 (30 seconds, driver state validation) through Level 3 (12+ minutes, exhaustive memory and compute testing).
Why it matters
DCGM field collection failures or stale timestamps are a critical signal -- when nv-hostengine crashes or a GPU becomes unresponsive to management queries, telemetry silently stops while the GPU may still appear to be operating. DCGM_FI_DEV_GPU_TEMP stopping updates on one GPU while its peers continue reporting often precedes an Xid 79 fall-off-bus event by minutes. Without DCGM, most GPU failure signals are invisible until they cause application-level crashes.
How to monitor
Deploy dcgm-exporter to expose DCGM field values to Prometheus and alert on stale collection timestamps per GPU UUID. Run Level 1 diagnostics (dcgmi diag -r 1) in Slurm prolog scripts between jobs to catch obviously broken GPUs. Factryze consumes DCGM telemetry as its primary data source, enriching raw field values with fleet-wide statistical baselines and temporal trend analysis for autonomous anomaly detection.
Related terms
Continuous tracking of GPU health, thermals, errors, and performance metrics.
GPU memory bit-flip errors detected via hardware ECC, signaling degradation.
Percentage of time GPU streaming multiprocessors are actively executing kernels.
Monitor this automatically
Factryze correlates GPU signals in real time: errors, clocks, and fabric health.
Get Started Free