Monitoring Metrics
DCGM (Data Center GPU Manager) exposes hundreds of telemetry fields, but effective GPU monitoring comes down to tracking the right metrics with the right thresholds. GPU utilization, memory bandwidth, temperature, power draw, and clock frequencies form the core health signals that every operations team should monitor continuously. Anomaly patterns in these metrics — such as a sudden clock frequency drop indicating thermal throttling, or GPU utilization falling to zero while memory remains allocated signaling a hung kernel — are often the earliest indicators of developing hardware or software issues. This section covers each essential monitoring metric with its DCGM field ID, normal operating ranges, alerting thresholds, and the correlation patterns that Factryze uses to distinguish between transient fluctuations and genuine degradation requiring intervention.
DCGM (Data Center GPU Manager)
NVIDIA's GPU management toolkit exposing health metrics via field IDs.
Fan Speed
GPU or chassis cooling fan speed as a percentage of maximum RPM.
DCGM_FI_DEV_FAN_SPEEDGPU Monitoring
Continuous tracking of GPU health, thermals, errors, and performance metrics.
GPU Utilization
Percentage of time GPU streaming multiprocessors are actively executing kernels.
DCGM_FI_DEV_GPU_UTILMemory Clock
GPU HBM/GDDR memory frequency in MHz that determines memory bandwidth.
DCGM_FI_DEV_MEM_CLOCKMemory Utilization
Percentage of GPU framebuffer memory allocated by active workloads.
DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREEPCIe Bandwidth
Measured GPU-to-host data transfer rate over PCI Express in GB/s.
DCGM_FI_DEV_PCIE_TX_THROUGHPUT / DCGM_FI_DEV_PCIE_RX_THROUGHPUTPower Capping
Limiting GPU power draw below TDP to control thermals and rack density.
DCGM_FI_DEV_POWER_USAGE / DCGM_FI_DEV_ENFORCED_POWER_LIMITRetired Pages
Cumulative count of permanently disabled GPU memory pages in InfoROM.
DCGM_FI_DEV_RETIRED_SBE / DCGM_FI_DEV_RETIRED_DBESM Clock (Streaming Multiprocessor Clock)
GPU core compute clock frequency in MHz, scaling between base and boost.
DCGM_FI_DEV_SM_CLOCKTDP (Thermal Design Power)
Maximum sustained GPU power dissipation rating, measured in watts.
DCGM_FI_DEV_ENFORCED_POWER_LIMITThermal Throttling
Automatic GPU clock reduction when die temperature exceeds 83-90C safe limits.
DCGM_FI_DEV_CLOCK_THROTTLE_REASONS