GPU Fallen Off Bus
Xid 79 error: GPU completely disconnects from the PCIe bus.
What it is
GPU fallen off bus is the failure condition reported as Xid 79, where the GPU becomes completely unresponsive to the host system over the PCIe bus. Root causes include PCIe link instability, power delivery issues, thermal damage, or hardware defects in the GPU or motherboard slot.
Why it matters
This is one of the most disruptive GPU failure modes: all running workloads on the affected device are killed instantly and nvidia-smi can no longer communicate with the GPU. Software-level reset via nvidia-smi -r is impossible because the device is unreachable on the bus. Repeated occurrences indicate a hardware fault requiring GPU or riser card replacement.
How to monitor
Watch dmesg for Xid 79 events and confirm with DCGM -- stale or missing telemetry from a specific GPU UUID while peers continue reporting is a secondary signal. nvidia-smi will list the GPU as Unknown or fail to enumerate it entirely. Recovery requires at minimum a cold reboot; Factryze automatically escalates through GPU reset, driver reload, and BMC-level power cycle when Xid 79 is detected.
Related terms
NVIDIA kernel-logged Xid error codes identifying specific GPU failure modes.
The host bus connecting GPUs to CPUs and other system devices.
Hardware GPU reset via nvidia-smi -r with escalation to ipmitool or cold reboot.
Monitor this automatically
Factryze correlates GPU signals in real time: errors, clocks, and fabric health.
Get Started Free