Skip to main content
GPU Glossary/Operations
Operations

Rolling Restart

Sequential node restarts for kernel updates and driver upgrades while maintaining cluster capacity.

What it is

A rolling restart restarts cluster nodes sequentially or in controlled batches rather than all at once, maintaining a target percentage of capacity for production workloads throughout. Use cases include Linux kernel updates, NVIDIA driver upgrades (e.g., 535.xxx to 550.xxx), CUDA toolkit updates, firmware flashes requiring cold reboot activation, and periodic clearing of accumulated GPU state. Each node follows a strict sequence: cordon, drain workloads, apply update, reboot, validate via DCGM Level 2 diagnostics, confirm versions, then uncordon.

Why it matters

A rolling restart without automated validation can propagate a bad driver version or broken firmware across the entire fleet -- pausing on post-reboot failure is critical. Without a rolling procedure, a full-cluster driver upgrade requires 3-5 days of manual work; with 10% parallelism automation, a 1,024-node cluster completes in 8-12 hours with 7,372+ GPUs continuously available for training. A node that fails post-reboot DCGM validation must be quarantined immediately to prevent scheduling jobs onto it.

How to monitor

Track DCGM Level 2 diagnostic pass/fail rates per batch during rolling restarts and auto-pause the rollout on failures. Verify driver version (nvidia-smi --query-gpu=driver_version) and firmware version (nvidia-smi --query-gpu=vbios_version) on each node before uncordoning. Factryze orchestrates rolling restarts with intelligent node ordering, integrates with checkpoint schedules to minimize wasted GPU-hours, and automatically rolls back nodes that fail post-update health validation.

Rolling Restart - Zero-Downtime Node UpdatesRolling Restart - Zero-Downtime Node Updates
Pinch to zoom, drag to pan, double-tap to toggle
Rolling Restart - Zero-Downtime Node UpdatesRolling Restart - Zero-Downtime Node Updates

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free