Xid Errors
NVIDIA kernel-logged Xid error codes identifying specific GPU failure modes.
What it is
Xid errors are numeric diagnostic codes emitted by the NVIDIA GPU kernel driver into dmesg, each mapping to a specific hardware or software failure mode. Critical codes include Xid 48 (double-bit ECC error), Xid 63 (row remapping failure), Xid 74 (NVLink error), Xid 79 (GPU fallen off PCIe bus), and Xid 94 (contained ECC error on Hopper). The severity spectrum ranges from informational events like Xid 13 to fatal hardware failures like Xid 79 requiring cold reboot and potential RMA.
Why it matters
Xid events surface seconds to minutes before failures become visible through DCGM counters or application-level errors, making them the fastest early-warning signal available. A burst of Xid 94 events on an H100 indicates the GPU's containment mechanism is catching ECC errors -- the underlying memory is degrading and the GPU should be scheduled for replacement. Missing or misclassifying Xid codes converts a quick automated reset into a 45-minute manual triage cycle.
How to monitor
Parse dmesg in real time for Xid event strings emitted by the nvidia kernel module. Correlate Xid codes with concurrent DCGM ECC counters (DCGM_FI_DEV_ECC_DBE_VOL_TOTAL) and NVLink error fields to determine root cause. Factryze's NOC Agent captures and classifies every Xid event, correlates it with DCGM telemetry and NVLink health data, and triggers the appropriate remediation runbook automatically.
Related terms
GPU memory bit-flip errors detected via hardware ECC, signaling degradation.
Xid 79 error: GPU completely disconnects from the PCIe bus.
CRC errors and replay events on NVLink GPU-to-GPU connections.
Monitor this automatically
Factryze correlates GPU signals in real time: errors, clocks, and fabric health.
Get Started Free