Driver Crash
GPU kernel driver panic or hang requiring intervention to recover.
What it is
A driver crash is a failure condition where the NVIDIA GPU kernel driver (nvidia.ko) encounters a fatal error, hangs, or panics. Symptoms include Xid 31 (GPU setup error), Xid 13 (graphics engine exception), complete GPU unresponsiveness, or a full kernel panic that takes down the host. Root causes range from firmware bugs and hardware faults to driver version incompatibilities.
Why it matters
A driver crash takes down all GPU workloads on the affected node simultaneously, not just those on a single device. A full kernel panic brings down the host entirely, potentially corrupting in-flight checkpoint data. Recurrent crashes on the same node that survive driver reload are strong indicators of underlying hardware failure requiring physical investigation.
How to monitor
Watch dmesg for Xid 31 and Xid 13 events, and monitor for hung nvidia-smi processes that indicate the driver ioctl interface is deadlocked. Stale DCGM telemetry timestamps across all GPUs on a node simultaneously point to a driver-level hang rather than per-device failure. Factryze detects driver-level anomalies by tracking DCGM collection continuity and escalates through driver reload and cold reboot as needed.
Related terms
Hardware GPU reset via nvidia-smi -r with escalation to ipmitool or cold reboot.
Reloading nvidia.ko via rmmod/modprobe to clear driver state without a full reboot.
NVIDIA kernel-logged Xid error codes identifying specific GPU failure modes.
Monitor this automatically
Factryze correlates GPU signals in real time: errors, clocks, and fabric health.
Get Started Free