12 terms

Monitoring Metrics

DCGM (Data Center GPU Manager) exposes hundreds of telemetry fields, but effective GPU monitoring comes down to tracking the right metrics with the right thresholds. GPU utilization, memory bandwidth, temperature, power draw, and clock frequencies form the core health signals that every operations team should monitor continuously. Anomaly patterns in these metrics — such as a sudden clock frequency drop indicating thermal throttling, or GPU utilization falling to zero while memory remains allocated signaling a hung kernel — are often the earliest indicators of developing hardware or software issues. This section covers each essential monitoring metric with its DCGM field ID, normal operating ranges, alerting thresholds, and the correlation patterns that Factryze uses to distinguish between transient fluctuations and genuine degradation requiring intervention.

Monitoring Metrics

DCGM (Data Center GPU Manager)

Fan Speed

GPU Monitoring

GPU Utilization

Memory Clock

Memory Utilization

PCIe Bandwidth

Power Capping

Retired Pages

SM Clock (Streaming Multiprocessor Clock)

TDP (Thermal Design Power)

Thermal Throttling