Skip to main content
GPU Glossary/Monitoring Metrics
Monitoring Metrics

Thermal Throttling

Automatic GPU clock reduction when die temperature exceeds 83-90C safe limits.

What it is

Thermal throttling is the automatic reduction of GPU SM and memory clock speeds by GPU firmware when die temperature exceeds safe limits -- typically 83C for the HW slowdown threshold and 90-92C for the shutdown threshold on A100 and H100. The throttle is progressive: a GPU at 85C may reduce SM clocks by 100-200 MHz (5-10% performance loss), while one approaching 90C can drop clocks by 500+ MHz (30-40% throughput reduction). DCGM exposes the throttle reason bitmask via DCGM_FI_DEV_CLOCK_THROTTLE_REASONS.

Why it matters

Thermal throttling silently reduces throughput without raising obvious errors -- a throttled GPU still shows high utilization while delivering 30-40% less compute than peers. Intermittent throttling correlated with time-of-day or ambient temperature changes indicates marginal cooling capacity where CRAC units cannot maintain setpoint during peak load. An 8-GPU node where GPUs 4-7 consistently run 3-5C hotter than GPUs 0-3 indicates airflow asymmetry that no software tuning can resolve.

How to monitor

Track DCGM_FI_DEV_CLOCK_THROTTLE_REASONS for nonzero bitmask values (bit 2 HW Slowdown, bit 3 HW Thermal Slowdown) and correlate with DCGM_FI_DEV_GPU_TEMP and DCGM_FI_DEV_FAN_SPEED. Compare per-GPU temperatures within the same node to detect chassis-level airflow asymmetry. Factryze correlates temperature, fan speed, and power draw to distinguish individual GPU thermal faults from systemic cooling issues, triggering automated power capping while alerting facilities.

Thermal Throttling - Temperature Impact on GPU PerformanceThermal Throttling - Temperature Impact on GPU Performance
Pinch to zoom, drag to pan, double-tap to toggle
Thermal Throttling - Temperature Impact on GPU PerformanceThermal Throttling - Temperature Impact on GPU Performance
DCGM Metric Field
DCGM_FI_DEV_CLOCK_THROTTLE_REASONS

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free