TDP (Thermal Design Power)
Maximum sustained GPU power dissipation rating, measured in watts.
What it is
TDP (Thermal Design Power) is the maximum sustained power a GPU is designed to dissipate under worst-case workload conditions, serving as the primary spec for cooling system design and power infrastructure provisioning. Reference values include 400W for A100 SXM, 700W for H100 SXM, 1000W for B200, and 350W for L40S. The current enforced power limit is exposed via DCGM_FI_DEV_ENFORCED_POWER_LIMIT and can be queried or modified through nvidia-smi.
Why it matters
When actual power draw approaches TDP for sustained periods, GPU firmware enforces power throttling by reducing SM and memory clock frequencies -- manifesting as DCGM_FI_DEV_CLOCK_THROTTLE_REASONS bit 5 (power brake slowdown). Dense 8x H100 nodes draw up to 10.2 kW per node; a GPU consistently drawing within 5% of its TDP limit means any ambient temperature rise can push it into throttling. Intermittent power throttling appears as random 3-5% throughput fluctuations with no obvious errors in training logs.
How to monitor
Track the ratio of DCGM_FI_DEV_POWER_USAGE to DCGM_FI_DEV_ENFORCED_POWER_LIMIT and alert when a GPU sustains above 95% of its limit. Correlate with DCGM_FI_DEV_CLOCK_THROTTLE_REASONS to confirm power-driven throttling. Factryze monitors this ratio fleet-wide, identifies GPUs operating in the power throttling zone, and can apply automated power caps to stabilize clock frequencies.
DCGM_FI_DEV_ENFORCED_POWER_LIMITRelated terms
Limiting GPU power draw below TDP to control thermals and rack density.
Automatic GPU clock reduction when die temperature exceeds 83-90C safe limits.
GPU core compute clock frequency in MHz, scaling between base and boost.
Monitor this automatically
Factryze correlates GPU signals in real time: errors, clocks, and fabric health.
Get Started Free