Skip to main content
GPU Glossary/Monitoring Metrics
Monitoring Metrics

TDP (Thermal Design Power)

Maximum sustained GPU power dissipation rating, measured in watts.

What it is

TDP (Thermal Design Power) is the maximum sustained power a GPU is designed to dissipate under worst-case workload conditions, serving as the primary spec for cooling system design and power infrastructure provisioning. Reference values include 400W for A100 SXM, 700W for H100 SXM, 1000W for B200, and 350W for L40S. The current enforced power limit is exposed via DCGM_FI_DEV_ENFORCED_POWER_LIMIT and can be queried or modified through nvidia-smi.

Why it matters

When actual power draw approaches TDP for sustained periods, GPU firmware enforces power throttling by reducing SM and memory clock frequencies -- manifesting as DCGM_FI_DEV_CLOCK_THROTTLE_REASONS bit 5 (power brake slowdown). Dense 8x H100 nodes draw up to 10.2 kW per node; a GPU consistently drawing within 5% of its TDP limit means any ambient temperature rise can push it into throttling. Intermittent power throttling appears as random 3-5% throughput fluctuations with no obvious errors in training logs.

How to monitor

Track the ratio of DCGM_FI_DEV_POWER_USAGE to DCGM_FI_DEV_ENFORCED_POWER_LIMIT and alert when a GPU sustains above 95% of its limit. Correlate with DCGM_FI_DEV_CLOCK_THROTTLE_REASONS to confirm power-driven throttling. Factryze monitors this ratio fleet-wide, identifies GPUs operating in the power throttling zone, and can apply automated power caps to stabilize clock frequencies.

TDP - Thermal Design Power BudgetTDP - Thermal Design Power Budget
Pinch to zoom, drag to pan, double-tap to toggle
TDP - Thermal Design Power BudgetTDP - Thermal Design Power Budget
DCGM Metric Field
DCGM_FI_DEV_ENFORCED_POWER_LIMIT

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free