Thermal Stragglers

A thermal straggler is a GPU that runs hotter than its peers, hits its junction-temperature throttle around 87 °C, and clocks down to protect itself. In a synchronous training step, every other GPU waits for the slowest peer's all-reduce to complete. One throttling GPU costs every peer the difference, every step, until the thermal asymmetry is resolved.

How it shows up

The signature is two correlated traces. The temperature time-series shows seven GPUs hovering inside a normal envelope (78 to 82 °C in a healthy DLC rack) and one GPU walking up across the run. Near the start of the run nothing looks wrong. As the temperature climbs past 87, the GPU enters thermal-throttle: clock drops, kernels slow, and the step-time bar for that rank lengthens.

The lower pane is what the operator usually notices first. Most fleet observability shows step-time per rank as a bar chart. A single bar that is 25 % longer than its peers is the giveaway. The temperature trace is the cause; the step-time bar is the effect.

Why thermal stragglers are different from random stragglers

Random stragglers come from anywhere: a noisy neighbor, a flaky NIC, a one-off kernel-launch hiccup. Thermal stragglers have three structural properties that make them easier to diagnose:

They are predictable. A GPU that is thermally marginal at step 100 will be thermally marginal at step 1000. Step-time outliers from this GPU are persistent across the run, not random.
They are location-correlated. Hot aisle, dirty air filter on a neighboring rack, weak fan, low-flow cold plate, partial blockage in the secondary loop. The same physical position keeps producing the same temperature behavior across different jobs that happen to land there.
They are recurring. A node that was a thermal straggler last week is the most likely node to be a thermal straggler this week. The fix is a physical fix, not a software fix.

Compare that to a silent data corruption event, which is one-shot and unsignalled. Thermal stragglers are noisy by comparison: temperature is a first-class telemetry signal and step time is observable per rank.

The detection signal

Two telemetry sources answer the question:

DCGM (NVIDIA Data Center GPU Manager) publishes DCGM_FI_DEV_GPU_TEMP per GPU at high cadence. Pair it with DCGM_FI_DEV_POWER_USAGE to distinguish a hot GPU running at its TDP (probably fine) from a hot GPU running below TDP because it is already throttling (definitely a straggler).
Per-rank step-time histograms from the training framework. PyTorch with NCCL exposes per-rank wall-clock per step; you can also infer it from collective-completion timestamps.

The correlation is the signal. A persistent step-time outlier on the same rank, paired with a temperature crossing 87 °C on the same GPU, is a thermal straggler with very high confidence. Either signal alone is ambiguous; together they are diagnostic.

What to do about it

The right response depends on whether the GPU is intermittently or persistently throttling.

Intermittent. Check airflow first. Most thermal stragglers in air-cooled racks come from a single dirty filter, a single failed fan, or a misaligned blank panel that lets recirculation contaminate the cold aisle. In DLC racks, check secondary-loop temperature on the affected rack: a partial blockage or low-flow event in one CDU loop will produce exactly this symptom.

Persistent. If the GPU is consistently 5 °C hotter than its peers under the same load, the fix is physical. Either the cold-plate contact has degraded (TIM aged or contaminated), the package itself is marginal, or there is a flow-balancing problem upstream. None of these are software-fixable. The operational answer is drain and replace for the affected node.

Power capping the hot GPU is a tempting shortcut and almost always wrong. Capping one GPU in a synchronous job creates a deterministic straggler instead of an intermittent one. The fleet step time becomes the capped GPU's step time. If you must cap, cap every rank uniformly.

Practical guidance

Watch P99 step time alongside mean. Mean is fine when the fleet is healthy and lies when it isn't.
Pair temperature with power draw when alerting. Hot at TDP is normal; hot at low draw is a thermal straggler that is already throttling.
Track straggler frequency by node. Persistent offenders are physical-fix candidates.
Build the link between observability and operations. A thermal straggler that nobody connects to a runbook is a 25 % productivity tax on every job that lands on that node.

The straggler problem has a long tail. Eliminating all thermal stragglers is not realistic at scale. Bounding the worst case with detection, alerts, and a fast drain-and-replace path is realistic, and is the difference between a fleet that runs at 90 % efficiency and one that runs at 70 %.

How it shows up

Why thermal stragglers are different from random stragglers

The detection signal

What to do about it

Practical guidance

See also