Skip to main content

MTBF Math

Fleet failure rate scales linearly with N. A 1024-GPU job using 50,000 hour per-GPU MTBF parts sees one unrecoverable failure roughly every two days.
Per-GPU MTBF
~50,000 h
1024-GPU MTBF
~49 h
8192-GPU MTBF
~6 h

Per-GPU reliability looks reassuring. Vendor MTBF (mean time between failures) for an H100 in a well-cooled rack lands in the tens of thousands of hours. The number that runs your operations playbook is not per-GPU MTBF. It is fleet MTBF, which falls linearly as you add GPUs to the job.

The math

For independent failures, the rate of the slowest-of-N distribution is the sum of the individual rates. If each GPU's failure rate is lambda = 1 / per_GPU_MTBF, the fleet rate is N * lambda, and the fleet MTBF is:

fleet_MTBF ≈ per_GPU_MTBF / N

The approximation hides correlated failures (a power circuit trip takes out a rack at once) and burst-mode events (a bad batch of HBM stacks shipped in the same node). Both push fleet MTBF below the simple division. Good operations assumes the linear estimate is the best case.

What it looks like at scale

FLEET MTBFCADENCE8 GPUs~260 daysannual64 GPUs~32 daysmonthly512 GPUs~4 daysweekly1024 GPUs~49 hrdaily8192 GPUs~6 hrhourlybars log-scaled. assumes per-GPU MTBF = 50,000 h.

The transition that matters is not the absolute number; it is the cadence. At 8 GPUs you handle failures in your annual hardware cycle. At 64 you handle them when convenient. At 512 you handle them at planned weekly maintenance windows. At 1024 you handle them every day, often more than once a day. At 8192 you handle them constantly, and the only viable response is automation.

Above 1000 GPUs, organizations that rely on humans to detect and triage every event run the fleet permanently degraded. There are not enough on-call hours in the week.

Per-GPU MTBF: where the 50,000 hour figure comes from

Vendor data sheets do not publish a single MTBF number. The 50,000 hour figure is a working approximation drawn from accelerated-life testing at the silicon level, in-field data published by hyperscalers, and incident postmortems from frontier-scale training runs. Real per-GPU MTBF varies meaningfully:

  • Steady-state operation in a well-cooled rack: 50,000 to 100,000 hours.
  • First few hundred hours after deployment ("infant mortality"): 5,000 to 20,000 hours.
  • Sustained high-temperature operation: 10,000 to 30,000 hours.
  • Tail of the bathtub curve, end of useful life: drops back below 20,000 hours.

The bathtub means that fresh fleets fail more often than mature fleets, then pass through a long flat middle, then degrade as the silicon ages. Hyperscalers typically retire GPUs from frontier training before the right-side rise, which is why the operating MTBF you should plan around is the steady-state figure.

Failure modes that drive the math

The "failure" in MTBF is not always a hard fault. The full mix includes:

  • Uncorrectable ECC events. Roughly 30 to 50 percent of incidents at scale.
  • NVLink errors. A flaky link drops the GPU from the all-reduce group; counted as a failure even when the silicon is fine.
  • GPU-fell-off-the-bus (Xid 79). PCIe link loss; requires cold reset, sometimes node reboot.
  • Thermal throttle stuck-on. Sustained throttling presents as a permanent straggler; counts as a failure under operational definitions.
  • Driver and firmware crashes. Less common, harder to triage; counted depending on whether a reboot resolves them.

A pure hardware MTBF excludes the last two; an operational MTBF includes them. Fleet operations runs on the operational definition, so use the operational MTBF when sizing checkpoint frequency and spare capacity.

What to do with the number

Three planning consequences follow directly from fleet MTBF:

  1. Checkpoint cadence. Save more often than fleet MTBF. A 1024-GPU job with fleet MTBF of 49 hours and a checkpoint every hour loses, on average, 30 minutes of work per failure event. Going to 15-minute checkpoints cuts that to 7.5 minutes per event, at the cost of more I/O traffic to the parallel filesystem.
  2. Hot spare ratio. Hold enough spares to cover expected failures over a drain and replace cycle. At 1024 GPUs and a 95-second automated cycle, a single 5 percent hot-spare pool absorbs 50+ failures before exhausting; ample.
  3. Run length. Above the size where fleet MTBF approaches the all-reduce timeout, synchronous training stops being viable without intra-job fault tolerance. The dividing line in 2026 is somewhere between 8000 and 16000 GPUs depending on stack maturity.

The math says daily failures at 1024 GPUs. The fleet either runs the math, or runs the fleet idle.

See also

Updated 2026-05-09