Health Check
DCGM diagnostic tests (Level 1/2/3) validating GPU hardware integrity between jobs.
What it is
A GPU health check is a scheduled or on-demand diagnostic test validating hardware integrity using DCGM's built-in diagnostic levels. Level 1 (dcgmi diag -r 1, ~30 seconds) validates driver responsiveness, basic GPU queries, and PCIe configuration. Level 2 (~2 minutes) adds memory bandwidth verification, SM compute validation, and PCIe throughput measurement. Level 3 (~12-15 minutes) performs exhaustive HBM bit-pattern testing, extended SM stress, and NVLink bandwidth validation capable of catching marginal faults that Levels 1 and 2 miss.
Why it matters
A Level 1 health check in a Slurm prolog catches a GPU stuck in a bad state after the previous job's CUDA context crash, preventing the next job from being allocated a non-functional device and saving hours of wasted training time. Without prolog health checks, degraded GPUs accumulate in the scheduling pool and are discovered only when a job fails. Level 3 diagnostics can detect HBM faults that produce no ECC errors yet but will fail under sustained load.
How to monitor
Run dcgmi diag -r 1 in Slurm prolog scripts and dcgmi diag -r 2 in epilog scripts between jobs. Reserve Level 3 for post-maintenance qualification and post-hardware-replacement validation. Factryze extends static schedules with predictive diagnostics -- the NOC Agent triggers Level 3 checks only on GPUs showing anomalous telemetry (rising DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, NVLink CRC upticks, or thermal trend deviations) to maximize coverage while minimizing GPU time lost to testing.
Related terms
NVIDIA's GPU management toolkit exposing health metrics via field IDs.
Continuous tracking of GPU health, thermals, errors, and performance metrics.
Gracefully removing a node from scheduling via kubectl drain or Slurm DRAIN state.
Monitor this automatically
Factryze correlates GPU signals in real time: errors, clocks, and fabric health.
Get Started Free