Skip to main content
GPU Glossary/Errors & Failures
Errors & Failures

Xid Errors

NVIDIA kernel-logged Xid error codes identifying specific GPU failure modes.

What it is

Xid errors are numeric diagnostic codes emitted by the NVIDIA GPU kernel driver into dmesg, each mapping to a specific hardware or software failure mode. Critical codes include Xid 48 (double-bit ECC error), Xid 63 (row remapping failure), Xid 74 (NVLink error), Xid 79 (GPU fallen off PCIe bus), and Xid 94 (contained ECC error on Hopper). The severity spectrum ranges from informational events like Xid 13 to fatal hardware failures like Xid 79 requiring cold reboot and potential RMA.

Why it matters

Xid events surface seconds to minutes before failures become visible through DCGM counters or application-level errors, making them the fastest early-warning signal available. A burst of Xid 94 events on an H100 indicates the GPU's containment mechanism is catching ECC errors -- the underlying memory is degrading and the GPU should be scheduled for replacement. Missing or misclassifying Xid codes converts a quick automated reset into a 45-minute manual triage cycle.

How to monitor

Parse dmesg in real time for Xid event strings emitted by the nvidia kernel module. Correlate Xid codes with concurrent DCGM ECC counters (DCGM_FI_DEV_ECC_DBE_VOL_TOTAL) and NVLink error fields to determine root cause. Factryze's NOC Agent captures and classifies every Xid event, correlates it with DCGM telemetry and NVLink health data, and triggers the appropriate remediation runbook automatically.

Xid Errors - GPU Driver Error ClassificationXid Errors - GPU Driver Error Classification
Pinch to zoom, drag to pan, double-tap to toggle
Xid Errors - GPU Driver Error ClassificationXid Errors - GPU Driver Error Classification

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free