Skip to main content
GPU Glossary/Operations
Operations

Health Check

DCGM diagnostic tests (Level 1/2/3) validating GPU hardware integrity between jobs.

What it is

A GPU health check is a scheduled or on-demand diagnostic test validating hardware integrity using DCGM's built-in diagnostic levels. Level 1 (dcgmi diag -r 1, ~30 seconds) validates driver responsiveness, basic GPU queries, and PCIe configuration. Level 2 (~2 minutes) adds memory bandwidth verification, SM compute validation, and PCIe throughput measurement. Level 3 (~12-15 minutes) performs exhaustive HBM bit-pattern testing, extended SM stress, and NVLink bandwidth validation capable of catching marginal faults that Levels 1 and 2 miss.

Why it matters

A Level 1 health check in a Slurm prolog catches a GPU stuck in a bad state after the previous job's CUDA context crash, preventing the next job from being allocated a non-functional device and saving hours of wasted training time. Without prolog health checks, degraded GPUs accumulate in the scheduling pool and are discovered only when a job fails. Level 3 diagnostics can detect HBM faults that produce no ECC errors yet but will fail under sustained load.

How to monitor

Run dcgmi diag -r 1 in Slurm prolog scripts and dcgmi diag -r 2 in epilog scripts between jobs. Reserve Level 3 for post-maintenance qualification and post-hardware-replacement validation. Factryze extends static schedules with predictive diagnostics -- the NOC Agent triggers Level 3 checks only on GPUs showing anomalous telemetry (rising DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, NVLink CRC upticks, or thermal trend deviations) to maximize coverage while minimizing GPU time lost to testing.

Health Check - GPU Diagnostic Test LevelsHealth Check - GPU Diagnostic Test Levels
Pinch to zoom, drag to pan, double-tap to toggle
Health Check - GPU Diagnostic Test LevelsHealth Check - GPU Diagnostic Test Levels

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free