Skip to main content

Power Capping

Software-enforced wattage limits per GPU (`nvidia-smi -pl`), used to fit a fleet inside a tight power budget at the cost of peak performance.
Tool
nvidia-smi -pl
H100 default
700 W
Effect
clock throttle

Power capping is the software lever that fits a fleet inside a power budget the hardware would otherwise exceed. The mechanism is simple: the GPU enforces a watt ceiling by reducing clock when the projected draw would cross the cap. The cost is throughput. The benefit is that the breaker stays closed.

What the cap actually does

800 W7006004000capuncapped (peaks ~720 W)capped at 600 Wstep time+12 % step timethe cap clips peak watts. the work still happens; it just takes longer.

nvidia-smi -pl <watts> sets a power-budget envelope for a GPU. Inside the firmware, the power management unit samples board draw at high frequency and adjusts clock down whenever the rolling estimate of next-interval draw would cross the cap. The cap is enforced as a clock throttle, not a voltage cut. From the workload's point of view, kernels run slower when they would otherwise have drawn more power.

For an H100 SXM5, the default cap is the rated 700 W. Setting nvidia-smi -pl 600 clips the upper end of the power curve to 600 W. The matmul-heavy regions of a training step that previously peaked above 700 W now flatten at the cap. The work still happens; the step simply takes longer. On transformer pretraining workloads, capping an H100 at 600 W typically loses 10 to 15 % step throughput. FP8 workloads tend to lose more (15 to 20 %) because they push the GPU harder per second of wall clock.

When capping is the right tool

Three legitimate uses:

  1. Facility constraint. The rack has 9.9 kW of usable continuous capacity, the workload draws 11 kW uncapped. Cap to bring the draw into spec. The fleet runs slower; it does not trip.
  2. Mixed workload colocation. Inference and training share a rack. Inference traffic is bursty, training is constant. Cap training so its peaks do not collide with inference bursts.
  3. Brownout response. The facility instructs the rack to draw less right now. The PDU can be told to enforce a tighter cap on every downstream GPU within seconds.

When capping is the wrong tool

The cap interacts badly with synchronous training. If you cap a GPU that is otherwise healthy, every other GPU in the same all-reduce now finishes faster than the capped one. The capped GPU is now a thermal straggler by another name: every step ends when it does, not when the average peer does. The blast radius of a single capped GPU is the entire job.

The right way to think about this: capping is a fleet-wide lever, not a per-GPU lever. If you must cap, cap every GPU in the same job uniformly. Capping individual GPUs to "save power on the slow ones" inverts the optimization and makes every step in that job worse, not better.

# Cap every GPU on a node uniformly
nvidia-smi -pl 600
 
# Inspect current cap
nvidia-smi --query-gpu=power.limit --format=csv,noheader
 
# Persistent across reboots requires either:
#  - a system service running nvidia-smi -pl on boot, or
#  - the persistent power limit feature, set via vendor tooling

The interaction with collectives

Synchronous training step time is bounded by the slowest rank. If every rank is capped at the same wattage and every rank is in the same thermal envelope, step time grows uniformly and the throughput hit is the predictable 10 to 15 %. If rank-by-rank thermal differences mean some GPUs hit the cap and some do not, you have introduced step-time variance that compounds with stragglers and blast radius. The variance, not the cap itself, is what hurts.

This is why operators that run capped fleets watch step-time P99 (or P99.9) more carefully than mean step time. The mean tells you the overall hit; the tail tells you whether the cap is biting unevenly.

Practical guidance

  1. Cap every GPU in a job uniformly. Never cap one rank in a synchronous training run.
  2. Validate the throughput cost on your workload before committing. The 10 to 15 % H100 number is representative, not universal.
  3. Watch step-time P99 alongside mean. If the tail grows, the cap is biting unevenly across the fleet, which usually means a thermal asymmetry that should be fixed at the cooling layer.
  4. Treat capping as a budget tool, not a long-term performance strategy. If your sustainable budget is consistently below your hardware's spec, you are buying the wrong density and should fix the rack design, not the software.

Capping is correct in some situations and wrong in many more. The rule of thumb is that it is correct when imposed by something outside the GPU's control (a building, a contract, a regulator) and almost always wrong when used to paper over a fleet-design or cooling problem the operator should be fixing upstream.

See also

Updated 2026-05-09