Node Draining
Gracefully removing a node from scheduling via kubectl drain or Slurm DRAIN state.
What it is
Node draining is the process of gracefully removing a compute node from the active scheduling pool for maintenance, hardware replacement, or firmware updates. In Slurm, scontrol update NodeName=gpu-node-042 State=DRAIN Reason='ECC degradation' prevents new job allocations while running jobs finish, with the node transitioning to DRAINED once all jobs complete. In Kubernetes, kubectl drain cordons the node and evicts pods respecting PodDisruptionBudgets and terminationGracePeriodSeconds.
Why it matters
Proactive draining before a GPU fails avoids mid-step crashes that waste all uncheckpointed progress across every GPU in the affected distributed training job. When a node with accelerating SBE rates is drained during the job's next checkpoint window (typically every 30-60 minutes for LLM training), the cost is one checkpoint overhead; waiting for a DBE can waste up to an hour of compute across 256 GPUs. The drain reason field is operationally critical for tracking why each node was removed and auditing RMA escalations.
How to monitor
Monitor DCGM_FI_DEV_ECC_SBE_VOL_TOTAL rate trends, NVLink error counters (DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL), and Xid event frequency per node to identify drain candidates. Track Slurm DRAINED node duration and drain reasons to measure remediation throughput. Factryze automates drain decisions by correlating ECC trends, NVLink error rates, thermal anomalies, and Xid patterns, coordinating drains with job checkpoint schedules to minimize wasted GPU-hours.
Related terms
Open-source HPC workload manager scheduling GPU cluster jobs via srun, sbatch, and squeue.
Forcibly stopping lower-priority GPU jobs with checkpoint/restart to free resources.
Sequential node restarts for kernel updates and driver upgrades while maintaining cluster capacity.
Monitor this automatically
Factryze correlates GPU signals in real time: errors, clocks, and fabric health.
Get Started Free