Scale AtlasChapter 8 of 86 termsUpdated 2026-05-09
Failure at Scale
What breaks at scale and how it propagates. Stragglers slow the slowest, blast radius decides recovery, MTBF math sets the operational ceiling.
Drain and Replace
The standard remediation runbook for a degraded GPU: checkpoint, drain in-flight work, cordon the node, activate a hot spare, restart. The faster this runs, the less fleet idles.
Fault Domains
Architectural boundaries within which a single failure cannot propagate. Rack, power phase, cooling loop, switch, and tenant. The smallest enclosing domain is your blast radius.
Gang Failure
Gang-scheduled jobs need every rank alive at every step. One rank dies, the whole gang stops. The gang is the failure unit, not the rank.
MTBF Math
Fleet failure rate scales linearly with N. A 1024-GPU job using 50,000 hour per-GPU MTBF parts sees one unrecoverable failure roughly every two days.
Silent Data Corruption
Faults that produce wrong results without raising any alarm. No Xid, no ECC counter, no error log. The output is just incorrect, and you only find out if you go looking.
Stragglers and Blast Radius
When one GPU slows down, every GPU slows down. The blast radius of a single failure determines how quickly a fleet recovers from incidents.