Skip to main content

Drain and Replace

The standard remediation runbook for a degraded GPU: checkpoint, drain in-flight work, cordon the node, activate a hot spare, restart. The faster this runs, the less fleet idles.
Manual
~65 min
Automated
~95 sec
Saves
40x wall

When a GPU starts misbehaving, every minute it stays in the running job is a minute the entire job runs degraded. Drain and replace is the runbook that gets it out. The procedure is mechanical. The cost of getting it wrong is measured in stalled GPU-hours.

The five phases

A clean drain and replace is five steps:

  1. Detect. A signal from the stragglers and blast radius channel: DCGM ECC counter spike, an Xid event in dmesg, or step-time variance crossing a P99 threshold.
  2. Checkpoint. The job saves its current state. With sharded checkpointing this writes in parallel across all ranks; without it, this is the longest single phase.
  3. Drain. The scheduler stops dispatching new work to the failing rank, ranks complete their current step, and the bad node is cordoned.
  4. Replace. A hot spare is promoted into the job's allocation, or the bad GPU is physically swapped (cold restart only). Hot spare promotion is fast; physical swap is the dominant cost in manual workflows.
  5. Restart. The job reloads the checkpoint, every rank rejoins the collective, and training resumes.

Manual vs automated wall time

The procedure is the same; the wall-clock cost is not.

TIME TO RECOVERYManual responsehuman in the looppage on-calltriagecheckpointdrainswap GPUrestart~65 min wallAutomated runbookagent in the loop~95 sec wallbands proportional to total wall time within each lane

The manual lane is dominated by human latency: paging the on-call, triage, deciding whether the signal is real, and then getting hands on the hardware for a physical swap. Best case in a well-run organization is roughly an hour. Worst case spans shifts, especially if the fault occurs overnight.

The automated lane runs the same five phases without humans in the loop. Detection is continuous. Checkpoint and drain are scripted. Promotion to a hot spare is an API call. Restart is the scheduler's normal restart-from-checkpoint path. End to end is under two minutes when the spare is warm.

The 40x ratio is what makes automation worth the engineering investment, not because each step is slower manually, but because each handoff between human steps adds minutes that the fleet pays for.

Hot spare strategy

A hot spare is a GPU that is already powered, already in the cluster's allocation pool, and already running a verified driver and firmware build. Promoting it into a job is a metadata change: the scheduler updates the rank-to-host map, the new rank loads the checkpoint, the all-reduce group re-forms.

The alternative, a cold spare, requires bring-up: power-on, driver load, NVLink topology check, NCCL warmup. Cold start adds 5 to 15 minutes per GPU. For at-scale fleets the math is straightforward:

hot_spare_overhead  = continuous power and rack space, no time cost on swap
cold_spare_overhead = no continuous cost, +5 to 15 min per swap event
break_even          = cold cost per swap × swaps per month
                      vs hot cost per month per spare GPU

Most fleets running 1024+ GPUs keep at least 5% of capacity as hot spares. The recurring power cost is small relative to the time cost of cold restart on every degradation event.

Coordinating with the scheduler

Drain and replace requires the scheduler's cooperation. On Slurm, the runbook calls scontrol update NodeName=<host> State=DRAIN Reason="atlas:drain", then requeues the job once the spare is allocated. On Kubernetes, it cordons the node with kubectl cordon, evicts the pod, and lets the controller schedule a fresh pod onto the spare. Both paths require the job's checkpoint to be reachable from any node in the allocation, which means a parallel filesystem under the run.

A common failure mode: the runbook drains the node but the checkpoint lives only on local NVMe of the original host. The replacement rank cannot find the state. This is why checkpoints belong on shared storage, not on per-node disks.

What "running flaky GPUs" actually costs

Skipping drain and replace, leaving a degraded GPU in the job because it "still works", is a real and common operational choice. It is also wrong at scale. A 1024-GPU job slowed 5% by a single straggler costs 51.2 GPU-hours per hour of training. At cloud rates of roughly $3 per H100-hour, that is $150 per hour of fleet idle paid to keep one bad GPU in service. Drain and replace, even at the manual one-hour wall time, pays back inside an hour of degraded operation.

The decision to drain is almost never about whether to drain. It is about how fast.

See also

Updated 2026-05-09