GPU Monitoring

Continuous tracking of GPU health, thermals, errors, and performance metrics.

What it is

GPU monitoring is the continuous collection, aggregation, and analysis of GPU health telemetry to maintain cluster reliability and maximize uptime. Key signals include temperature (DCGM_FI_DEV_GPU_TEMP), utilization (DCGM_FI_DEV_GPU_UTIL), ECC error counts, power draw (DCGM_FI_DEV_POWER_USAGE), and PCIe/NVLink throughput. Traditional monitoring stacks built on Prometheus and Grafana rely on static thresholds; autonomous monitoring adds AI-driven anomaly detection and automated remediation across hundreds of DCGM field IDs simultaneously.

Why it matters

A GPU reporting 95% utilization with SM clocks stuck at 1200 MHz instead of the expected 1980 MHz boost is delivering half the expected throughput with no obvious alert. A single GPU in an 8-GPU node running 5C hotter than its peers often precedes thermal throttling or hardware failure within days -- patterns that static thresholds miss entirely. Undetected GPU degradation silently caps training throughput for every job running on the affected device.

How to monitor

Deploy DCGM alongside dcgm-exporter to expose DCGM_FI_DEV_GPU_TEMP, DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_POWER_USAGE, and ECC counters to Prometheus. Add peer-comparison alerting to catch intra-node divergence. Factryze's NOC Agent correlates DCGM metrics, Xid kernel events, and fabric manager logs into a unified health model that detects anomalies before they impact training jobs.

Related terms

DCGM (Data Center GPU Manager)

NVIDIA's GPU management toolkit exposing health metrics via field IDs.

GPU Utilization

Percentage of time GPU streaming multiprocessors are actively executing kernels.

Thermal Throttling

Automatic GPU clock reduction when die temperature exceeds 83-90C safe limits.

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free