Stragglers and Blast Radius

A synchronous training step ends when every rank finishes its forward pass, every rank finishes its backward pass, and the all-reduce that follows is complete. Step time is therefore not the average GPU's time. It is the slowest GPU's time. One degraded chip drags the rest along behind it.

Stragglers, defined

A straggler is a rank whose per-step duration sits meaningfully above the cohort median. The cause might be transient (thermal throttling, an ECC retry, a slow PCIe rebind) or persistent (a partial NVLink failure, a flaky HBM stack with a high single-bit-error rate, a bad DIMM on the host CPU). Either way, the symptom is the same: every healthy GPU spends the difference idle.

The math is brutal. If 7 GPUs finish their step in 100 ms and one takes 178 ms, the step takes 178 ms. The 7 healthy GPUs each waste 78 ms per step. At 100,000 steps per training run, that is over two GPU-hours of stall time per healthy GPU. Multiply across a 1024-GPU job and the loss is measured in days.

Blast radius

Blast radius is the count of GPUs that are affected by a single failure. In synchronous data-parallel training with FSDP or ZeRO sharding, the blast radius is the entire job: when one rank cannot complete an all-reduce, every rank blocks. The radius is N out of N.

Tensor parallelism within a node has a smaller blast radius, but only because the failure unit is also smaller. If one GPU in a 4-way tensor-parallel group dies, the whole group goes down, but other groups keep running. Pipeline parallelism is similar: one bad stage stalls the pipeline, but only that pipeline.

The architectural tradeoff: smaller failure domains shrink the blast radius but cost coordination overhead. The right answer depends on how often you expect failures and how expensive recovery is.

MTBF math at fleet scale

Per-GPU MTBF (mean time between failures) figures from NVIDIA quote tens of thousands of hours. The number that matters at scale is the fleet MTBF, which is roughly the per-GPU MTBF divided by N:

fleet_MTBF ≈ per_GPU_MTBF / N
 
per_GPU_MTBF = 50,000 h   (vendor figure)
N            = 1024
fleet_MTBF   ≈ 49 h       (about two days)

A 1024-GPU job will see, on average, an unrecoverable failure every two days. A 16,000-GPU job sees one every three hours. This is why training runs at frontier scale invest in checkpoint frequency and asynchronous restart, not because individual GPUs are unreliable but because the product of N and time defeats good per-GPU reliability.

Detection: the seconds that matter

Once a GPU enters a failure mode, every second before remediation is a second of stalled fleet. Three signals sit ahead of the all-reduce timeout:

DCGM ECC counters. A burst of single-bit errors precedes most uncorrectable failures by minutes to hours. The volatile counter (DCGM_FI_DEV_ECC_SBE_VOL_TOTAL) is the early-warning channel.
Xid events in dmesg. The NVIDIA kernel driver emits Xid codes seconds before the failure becomes visible to userspace. Xid 48 (uncorrectable ECC), Xid 79 (GPU off the bus), and Xid 94 (contained ECC on Hopper) are the high-severity codes.
Per-rank step-time variance. Track the rolling P99 of per-rank step duration. A GPU whose step time has drifted 30% above the cohort median is already a straggler, even before any hardware fault is logged.

Operational play

Checkpoint at a cadence shorter than fleet MTBF. A 1024-GPU job saving every hour loses on average 30 minutes of work per failure; saving every 15 minutes loses 7.5 minutes.
Scope alerts to SLA exposure, not raw event volume. A single Xid 48 on a node hosting one rank of a 1024-GPU job is a fleet-wide incident. The same Xid 48 on an idle node is a maintenance ticket.
Drain proactively, not reactively. Once a GPU's straggler signal is consistent for more than a few minutes, the cheapest path is to checkpoint the job, drain the rank, swap in a hot spare, and restart. Waiting for the GPU to fail outright costs the time-to-detect plus the time-to-restart from a less recent checkpoint.

The instinct to keep a flaky GPU running ("it's still working") is wrong at fleet scale. Every minute it stays in the job is a minute the entire job runs degraded.

Stragglers, defined

Blast radius

MTBF math at fleet scale

Detection: the seconds that matter

Operational play

See also