Skip to main content
GPU Glossary/Errors & Failures
Errors & Failures

Driver Crash

GPU kernel driver panic or hang requiring intervention to recover.

What it is

A driver crash is a failure condition where the NVIDIA GPU kernel driver (nvidia.ko) encounters a fatal error, hangs, or panics. Symptoms include Xid 31 (GPU setup error), Xid 13 (graphics engine exception), complete GPU unresponsiveness, or a full kernel panic that takes down the host. Root causes range from firmware bugs and hardware faults to driver version incompatibilities.

Why it matters

A driver crash takes down all GPU workloads on the affected node simultaneously, not just those on a single device. A full kernel panic brings down the host entirely, potentially corrupting in-flight checkpoint data. Recurrent crashes on the same node that survive driver reload are strong indicators of underlying hardware failure requiring physical investigation.

How to monitor

Watch dmesg for Xid 31 and Xid 13 events, and monitor for hung nvidia-smi processes that indicate the driver ioctl interface is deadlocked. Stale DCGM telemetry timestamps across all GPUs on a node simultaneously point to a driver-level hang rather than per-device failure. Factryze detects driver-level anomalies by tracking DCGM collection continuity and escalates through driver reload and cold reboot as needed.

Driver Crash - GPU Driver Crash and Recovery SequenceDriver Crash - GPU Driver Crash and Recovery Sequence
Pinch to zoom, drag to pan, double-tap to toggle
Driver Crash - GPU Driver Crash and Recovery SequenceDriver Crash - GPU Driver Crash and Recovery Sequence

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free