Fault Domains

A fault domain is a physical or logical boundary that contains the consequences of a single failure. Inside the domain, a fault propagates freely. Outside it, the fault is invisible. Good cluster design stacks fault domains so that the smallest possible boundary catches each class of failure.

The standard hierarchy

In a typical AI training data center, fault domains nest from smallest to largest:

Single GPU. ECC events, NVLink errors on a single link, thermal throttle on one chip. Contained to one device.
Server / chassis. PCIe bus failure, motherboard event, CPU memory fault. Contained to one host (8 GPUs in an HGX, fewer in PCIe-only nodes).
Rack. A failed top-of-rack switch, a tripped rack PDU, a cooling-loop fault. Roughly 32 to 72 GPUs depending on rack design.
Power phase. A circuit breaker trip on one of three phases takes out one third of the affected branch. Anywhere from a couple of racks to a row.
Cooling loop. A direct-liquid-cooling header pressure event drains every rack on the loop. Architecturally similar in scope to a power phase.
Switch fabric leaf or spine. A leaf switch failure isolates one rack from the fabric. A spine failure halves cross-section bandwidth and may stall every job spanning the full fabric.
Region. Whole-DC events: utility power, fiber cut, HVAC system failure. Rare, catastrophic.

The goal in placement is to size your jobs so they fit inside one domain or are sharded across domains in a pattern that survives losing any single one.

Domain awareness in scheduling

Schedulers that "know" the topology can place ranks so that a single domain failure does not take out the whole job. Two patterns:

Domain-local placement. Pack a small job into a single rack so the rack's failure is the worst case. The blast radius is bounded by the rack size, not by the job size.
Domain-spread placement. For redundant or replicated workloads, spread replicas across racks so no single rack outage breaks the job. The same pattern that powers cloud zone-redundancy applies inside a single GPU cluster.

For synchronous training, domain-local is usually right: the job is monolithic, every rank is critical, and the smaller the blast radius the cheaper the recovery. For data preprocessing pipelines, replica services, or anything tolerant of restart, domain-spread is right: the redundancy buys real availability.

Why "the smallest enclosing domain" matters

When you write your runbook, the answer to "which GPUs need to be drained?" is not "the failed one." It is "every GPU inside the smallest fault domain that contains the failed one, plus any GPUs participating in the same job as those."

A bad NVLink between two GPUs in a rack does not require draining the whole rack. A failed top-of-rack switch does. A drained DLC header does. The drain decision follows the domain, not the visible symptom.

Misclassifying the domain is expensive in both directions. Drain too small a domain and the failure recurs because the upstream cause is still active. Drain too large a domain and you idle GPUs that were never affected.

Where fault domains break the abstraction

The neat hierarchy above hides three real-world wrinkles:

NVLink and NVSwitch domains do not align with rack boundaries on every chassis. NVL72-style racks put 72 GPUs on a single NVLink switch fabric, which means the rack is also the NVLink failure boundary. Older 8-GPU HGX nodes spread NVLink across smaller groups.
Power and network often do not share a phase boundary with cooling. A facility designed with N+1 redundancy on power may have N+0 cooling, in which case the cooling loop is the dominant single-fault domain even though it is invisible from the network topology.
Multi-tenant clusters add a logical fault domain on top of the physical one. A noisy-neighbor incident or per-tenant scheduling bug propagates inside one tenant but not across tenants. Get this isolation wrong and a customer outage cascades into every customer on the same fabric.

The architectural rule of thumb: enumerate every shared resource (power, cooling, switch, rack PDU, BMC, control-plane node) and treat each as its own fault domain. The number of distinct domains in a real cluster is larger than most diagrams suggest. The number of correlated failure modes is even larger.

The standard hierarchy

Domain awareness in scheduling

Why "the smallest enclosing domain" matters

Where fault domains break the abstraction

See also