Gang Failure

A gang job is one where every rank must be alive simultaneously for any rank to make progress. Synchronous training is the canonical example: forward and backward passes are joined by collective operations, every collective requires every rank, and the loss of any one rank takes the whole job offline. The gang is the failure unit. Lose one rank, restart the whole thing.

Two failure modes, two outcomes

The contrast is stark. A replicated service handles a rank failure as a degradation: traffic drops to the surviving replicas, capacity falls 1/N, the job continues. A gang job handles the same rank failure as an outage: every other rank stops, the all-reduce times out, the job terminates and rolls back to its last checkpoint.

The architecture is the choice. Synchronous data-parallel training, tensor parallelism, and pipeline parallelism are gang patterns by construction. Inference servers, data preprocessing pipelines, and replicated batch jobs are non-gang patterns. The same hardware can run either; the failure semantics follow the workload, not the cluster.

Why training is gang

Training step looks like this in the synchronous case:

for step in range(num_steps):
    out = model(batch)              # every rank, in parallel
    loss = loss_fn(out, labels)     # every rank
    loss.backward()                 # every rank, gradients local
    all_reduce(model.gradients)     # every rank participates, blocks all
    optimizer.step()                # every rank, applies the averaged gradient

The all_reduce line is the gang gate. NCCL or its equivalent will not return on any rank until every rank has contributed and every rank has received the result. If rank 3 dies during its backward pass, ranks 0 through 7 sit in all_reduce, the NCCL timeout eventually fires (default tens of minutes), the process group raises, and the job exits. Every step of work since the last checkpoint is gone.

Pipeline parallelism is technically less synchronous (different micro-batches occupy different stages at the same instant), but the gang property still holds: a stage failure starves all downstream stages, the rolling pipeline drains in seconds, and the job halts.

What gang failure costs

Three numbers determine the wall-clock cost of a gang failure:

Time since last checkpoint. This is the work that vaporizes. With hourly checkpoints, average loss per failure is 30 minutes.
Restart latency. Time to allocate replacement GPUs, drain the bad node, reload checkpoints, re-form the all-reduce group. With automation, low single-digit minutes; without, 30 minutes to several hours.
Throughput tax of slow checkpoints. Frequent checkpoints reduce loss-per-failure but cost training throughput while writing. The right cadence depends on fleet MTBF and the time it takes to write a single checkpoint.

A 1024-GPU job with hourly checkpoints, 5-minute restart latency, and one failure every two days loses roughly 35 minutes per failure event. Over a 30-day run with 15 failures, that is ~9 hours of fleet time. Cutting checkpoints to every 15 minutes reduces the loss to 12.5 minutes per event and ~3 hours over 30 days; the difference of 6 hours of fleet time at 1024 H100s is many thousands of dollars and the cost of the extra I/O traffic is comparatively small.

Mitigations

Three categories of mitigation change the gang failure equation:

Smaller gangs. Tensor-parallel groups of 4 or 8 GPUs each form their own gang. A failure inside one group requires restarting that group, not the whole job. Pipeline-parallel stages do the same. The smaller the per-gang count, the lower the per-failure cost, at the price of more inter-gang coordination.
Hot spares with state replication. Keep a small replicated rank that mirrors a primary's state. On failure, promote the spare and continue. This is expensive (each spare halves throughput in its slot) and only practical for workloads where the cost of full restart exceeds the cost of replication.
Async or partial-sync algorithms. Recent research (Async DiLoCo, local-SGD variants) tolerates rank loss without immediate restart by allowing periodic divergence between groups. Operationally cleaner; still less common than fully-synchronous SGD/Adam at frontier scale.

Operational rule

The simplest rule is the most useful: design checkpoint cadence around expected failure cadence, not around what feels comfortable. If fleet MTBF says 49 hours at your scale, hourly checkpoints leave roughly 50% of the work on the table when a failure hits the average. 15-minute checkpoints leave 12.5%. Pick the trade explicitly; do not let the default settle by accident.

The cheapest gang failure is the one where the fleet was checkpointed five minutes before. The most expensive is the one where it was checkpointed two hours before.

Two failure modes, two outcomes

Why training is gang

What gang failure costs

Mitigations

Operational rule

See also