Skip to main content
12 terms

Monitoring Metrics

DCGM (Data Center GPU Manager) exposes hundreds of telemetry fields, but effective GPU monitoring comes down to tracking the right metrics with the right thresholds. GPU utilization, memory bandwidth, temperature, power draw, and clock frequencies form the core health signals that every operations team should monitor continuously. Anomaly patterns in these metrics — such as a sudden clock frequency drop indicating thermal throttling, or GPU utilization falling to zero while memory remains allocated signaling a hung kernel — are often the earliest indicators of developing hardware or software issues. This section covers each essential monitoring metric with its DCGM field ID, normal operating ranges, alerting thresholds, and the correlation patterns that Factryze uses to distinguish between transient fluctuations and genuine degradation requiring intervention.