CUDA Errors
CUDA runtime and driver API error codes indicating GPU compute failures.
What it is
CUDA errors are enumerated error codes returned by the CUDA runtime API (cudaError_t) or driver API (CUresult) when GPU kernel launches, memory operations, or device management calls fail. Key codes include cudaErrorMemoryAllocation (error 2, OOM), cudaErrorIllegalAddress (error 700, invalid device memory access), cudaErrorECCUncorrectable (error 214, hardware ECC failure), and cudaErrorSystemDriverMismatch (error 803, version incompatibility). CUDA errors are either non-sticky (clear after the failing call) or sticky (poison the entire CUDA context, requiring a device reset or process restart).
Why it matters
When an H100 experiences a DBE during a training step, the application receives cudaErrorECCUncorrectable, the CUDA context becomes permanently invalid, and all subsequent CUDA calls on that device fail regardless of whether they touch the faulty page. Distinguishing hardware-caused CUDA errors from software bugs is critical -- cudaErrorECCUncorrectable always indicates hardware degradation, while cudaErrorIllegalAddress may be either a kernel bug or a memory cell fault. Misclassification leads to wasted engineering time on kernel debugging when the GPU itself needs replacement.
How to monitor
Correlate CUDA errors from application telemetry with concurrent Xid events in dmesg and DCGM ECC counters (DCGM_FI_DEV_ECC_DBE_VOL_TOTAL). A cudaErrorECCUncorrectable coinciding with Xid 48 confirms hardware root cause; cudaErrorIllegalAddress without concurrent Xid or ECC signals points to a software kernel bug. Factryze automatically classifies CUDA errors as hardware-rooted (triggering GPU drain) or software-rooted (flagging for developer investigation) by cross-referencing these signals.
Related terms
GPU memory bit-flip errors detected via hardware ECC, signaling degradation.
Hardware GPU reset via nvidia-smi -r with escalation to ipmitool or cold reboot.
GPU kernel driver panic or hang requiring intervention to recover.
Monitor this automatically
Factryze correlates GPU signals in real time: errors, clocks, and fabric health.
Get Started Free