Rolling Restart
Sequential node restarts for kernel updates and driver upgrades while maintaining cluster capacity.
What it is
A rolling restart restarts cluster nodes sequentially or in controlled batches rather than all at once, maintaining a target percentage of capacity for production workloads throughout. Use cases include Linux kernel updates, NVIDIA driver upgrades (e.g., 535.xxx to 550.xxx), CUDA toolkit updates, firmware flashes requiring cold reboot activation, and periodic clearing of accumulated GPU state. Each node follows a strict sequence: cordon, drain workloads, apply update, reboot, validate via DCGM Level 2 diagnostics, confirm versions, then uncordon.
Why it matters
A rolling restart without automated validation can propagate a bad driver version or broken firmware across the entire fleet -- pausing on post-reboot failure is critical. Without a rolling procedure, a full-cluster driver upgrade requires 3-5 days of manual work; with 10% parallelism automation, a 1,024-node cluster completes in 8-12 hours with 7,372+ GPUs continuously available for training. A node that fails post-reboot DCGM validation must be quarantined immediately to prevent scheduling jobs onto it.
How to monitor
Track DCGM Level 2 diagnostic pass/fail rates per batch during rolling restarts and auto-pause the rollout on failures. Verify driver version (nvidia-smi --query-gpu=driver_version) and firmware version (nvidia-smi --query-gpu=vbios_version) on each node before uncordoning. Factryze orchestrates rolling restarts with intelligent node ordering, integrates with checkpoint schedules to minimize wasted GPU-hours, and automatically rolls back nodes that fail post-update health validation.
Related terms
Gracefully removing a node from scheduling via kubectl drain or Slurm DRAIN state.
Reloading nvidia.ko via rmmod/modprobe to clear driver state without a full reboot.
DCGM diagnostic tests (Level 1/2/3) validating GPU hardware integrity between jobs.
Monitor this automatically
Factryze correlates GPU signals in real time: errors, clocks, and fabric health.
Get Started Free