Skip to main content
GPU Glossary/Operations
Operations

Driver Reload

Reloading nvidia.ko via rmmod/modprobe to clear driver state without a full reboot.

What it is

A driver reload unloads and reloads the NVIDIA GPU kernel driver (nvidia.ko) and all its dependent modules without rebooting the host, reinitializing every GPU on the node simultaneously. The procedure requires a strict sequence: terminate all GPU processes, stop nvidia-persistenced (nvidia-smi -pm 0), unload modules in reverse dependency order (rmmod nvidia-uvm, rmmod nvidia-drm, rmmod nvidia-modeset, rmmod nvidia), then reload via modprobe nvidia. Re-enable persistence mode (nvidia-smi -pm 1) after reload to avoid cold-start latency on the first CUDA call.

Why it matters

Driver reload resolves failure modes that a single GPU reset cannot address: driver-level state corruption across multiple GPUs, hung nvidia-smi (where nvidia-smi -r is impossible because the tool itself cannot communicate with the driver), and Xid 13 patterns affecting the driver rather than a single GPU. When nvidia-smi hangs indefinitely, rmmod/modprobe from the host shell bypasses the deadlocked ioctl interface and restores management capability in 15-30 seconds. It clears all driver state and activates pending page retirements across every GPU on the node.

How to monitor

After driver reload, confirm all expected GPUs enumerate via nvidia-smi and run dcgmi diag -r 1 on each. Verify GPU count, NVLink topology (nvidia-smi topo -m), and PCIe link widths match the node's expected configuration before returning to the scheduling pool. Factryze's SRE Agent uses driver reload as the second step in its remediation escalation ladder, automatically handling module unload sequencing and validating the full GPU inventory after reload.

Driver Reload - GPU Driver Reload Without Full RebootDriver Reload - GPU Driver Reload Without Full Reboot
Pinch to zoom, drag to pan, double-tap to toggle
Driver Reload - GPU Driver Reload Without Full RebootDriver Reload - GPU Driver Reload Without Full Reboot

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free