Operations

GPU Reset

Hardware GPU reset via nvidia-smi -r with escalation to ipmitool or cold reboot.

What it is

A GPU reset is a hardware-level reinitialization of a single GPU device without rebooting the host, triggered via nvidia-smi -r (targeting a specific GPU by index or UUID) or programmatically through the DCGM and NVML APIs. The reset clears all CUDA contexts, resets volatile ECC counters, activates pending page retirements, and restores the GPU to a clean state in seconds. Prerequisites: all GPU processes must be terminated first, the GPU must be PCIe-responsive, and MIG instances must be destroyed before reset.

Why it matters

A GPU reset resolves sticky CUDA errors, hung compute engines, and driver-level state corruption in 5-10 seconds versus the 3-5 minutes required for a full node reboot -- a critical difference when a single stuck GPU is blocking a 512-GPU training job. However, if nvidia-smi -r fails because the GPU is unresponsive on the PCIe bus (Xid 79), software reset is impossible and escalation to BMC-level power cycle or cold reboot is required. The escalation ladder is: GPU reset, driver reload, ipmitool power cycle, cold reboot.

How to monitor

Confirm reset success by querying GPU state via nvidia-smi immediately after and running dcgmi diag -r 1 for validation. Check DCGM_FI_DEV_ECC_SBE_VOL_TOTAL and DCGM_FI_DEV_RETIRED_PENDING have been cleared and activated respectively. Factryze's SRE Agent automatically executes the full escalation ladder, validating GPU health via DCGM Level 1 diagnostic after each attempt before returning the GPU to the production scheduling pool.

Related terms

Driver Reload

Reloading nvidia.ko via rmmod/modprobe to clear driver state without a full reboot.

GPU Fallen Off Bus

Xid 79 error: GPU completely disconnects from the PCIe bus.

Page Retirement

GPU firmware permanently disabling faulty memory pages after ECC errors.

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free