Thermal Throttling
Automatic GPU clock reduction when die temperature exceeds 83-90C safe limits.
What it is
Thermal throttling is the automatic reduction of GPU SM and memory clock speeds by GPU firmware when die temperature exceeds safe limits -- typically 83C for the HW slowdown threshold and 90-92C for the shutdown threshold on A100 and H100. The throttle is progressive: a GPU at 85C may reduce SM clocks by 100-200 MHz (5-10% performance loss), while one approaching 90C can drop clocks by 500+ MHz (30-40% throughput reduction). DCGM exposes the throttle reason bitmask via DCGM_FI_DEV_CLOCK_THROTTLE_REASONS.
Why it matters
Thermal throttling silently reduces throughput without raising obvious errors -- a throttled GPU still shows high utilization while delivering 30-40% less compute than peers. Intermittent throttling correlated with time-of-day or ambient temperature changes indicates marginal cooling capacity where CRAC units cannot maintain setpoint during peak load. An 8-GPU node where GPUs 4-7 consistently run 3-5C hotter than GPUs 0-3 indicates airflow asymmetry that no software tuning can resolve.
How to monitor
Track DCGM_FI_DEV_CLOCK_THROTTLE_REASONS for nonzero bitmask values (bit 2 HW Slowdown, bit 3 HW Thermal Slowdown) and correlate with DCGM_FI_DEV_GPU_TEMP and DCGM_FI_DEV_FAN_SPEED. Compare per-GPU temperatures within the same node to detect chassis-level airflow asymmetry. Factryze correlates temperature, fan speed, and power draw to distinguish individual GPU thermal faults from systemic cooling issues, triggering automated power capping while alerting facilities.
DCGM_FI_DEV_CLOCK_THROTTLE_REASONSRelated terms
Maximum sustained GPU power dissipation rating, measured in watts.
Limiting GPU power draw below TDP to control thermals and rack density.
Continuous tracking of GPU health, thermals, errors, and performance metrics.
Monitor this automatically
Factryze correlates GPU signals in real time: errors, clocks, and fabric health.
Get Started Free