Skip to main content
GPU Glossary/Operations
Operations

GPU Reset

Hardware GPU reset via nvidia-smi -r with escalation to ipmitool or cold reboot.

What it is

A GPU reset is a hardware-level reinitialization of a single GPU device without rebooting the host, triggered via nvidia-smi -r (targeting a specific GPU by index or UUID) or programmatically through the DCGM and NVML APIs. The reset clears all CUDA contexts, resets volatile ECC counters, activates pending page retirements, and restores the GPU to a clean state in seconds. Prerequisites: all GPU processes must be terminated first, the GPU must be PCIe-responsive, and MIG instances must be destroyed before reset.

Why it matters

A GPU reset resolves sticky CUDA errors, hung compute engines, and driver-level state corruption in 5-10 seconds versus the 3-5 minutes required for a full node reboot -- a critical difference when a single stuck GPU is blocking a 512-GPU training job. However, if nvidia-smi -r fails because the GPU is unresponsive on the PCIe bus (Xid 79), software reset is impossible and escalation to BMC-level power cycle or cold reboot is required. The escalation ladder is: GPU reset, driver reload, ipmitool power cycle, cold reboot.

How to monitor

Confirm reset success by querying GPU state via nvidia-smi immediately after and running dcgmi diag -r 1 for validation. Check DCGM_FI_DEV_ECC_SBE_VOL_TOTAL and DCGM_FI_DEV_RETIRED_PENDING have been cleared and activated respectively. Factryze's SRE Agent automatically executes the full escalation ladder, validating GPU health via DCGM Level 1 diagnostic after each attempt before returning the GPU to the production scheduling pool.

GPU Reset - Escalation Decision TreeGPU Reset - Escalation Decision Tree
Pinch to zoom, drag to pan, double-tap to toggle
GPU Reset - Escalation Decision TreeGPU Reset - Escalation Decision Tree

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free