Skip to main content
GPU Glossary/Errors & Failures
Errors & Failures

CUDA Errors

CUDA runtime and driver API error codes indicating GPU compute failures.

What it is

CUDA errors are enumerated error codes returned by the CUDA runtime API (cudaError_t) or driver API (CUresult) when GPU kernel launches, memory operations, or device management calls fail. Key codes include cudaErrorMemoryAllocation (error 2, OOM), cudaErrorIllegalAddress (error 700, invalid device memory access), cudaErrorECCUncorrectable (error 214, hardware ECC failure), and cudaErrorSystemDriverMismatch (error 803, version incompatibility). CUDA errors are either non-sticky (clear after the failing call) or sticky (poison the entire CUDA context, requiring a device reset or process restart).

Why it matters

When an H100 experiences a DBE during a training step, the application receives cudaErrorECCUncorrectable, the CUDA context becomes permanently invalid, and all subsequent CUDA calls on that device fail regardless of whether they touch the faulty page. Distinguishing hardware-caused CUDA errors from software bugs is critical -- cudaErrorECCUncorrectable always indicates hardware degradation, while cudaErrorIllegalAddress may be either a kernel bug or a memory cell fault. Misclassification leads to wasted engineering time on kernel debugging when the GPU itself needs replacement.

How to monitor

Correlate CUDA errors from application telemetry with concurrent Xid events in dmesg and DCGM ECC counters (DCGM_FI_DEV_ECC_DBE_VOL_TOTAL). A cudaErrorECCUncorrectable coinciding with Xid 48 confirms hardware root cause; cudaErrorIllegalAddress without concurrent Xid or ECC signals points to a software kernel bug. Factryze automatically classifies CUDA errors as hardware-rooted (triggering GPU drain) or software-rooted (flagging for developer investigation) by cross-referencing these signals.

CUDA Errors - Runtime Error Codes and ClassificationCUDA Errors - Runtime Error Codes and Classification
Pinch to zoom, drag to pan, double-tap to toggle
CUDA Errors - Runtime Error Codes and ClassificationCUDA Errors - Runtime Error Codes and Classification

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free