Firmware Update
Updating GPU InfoROM, VBIOS, and NVSwitch firmware during scheduled maintenance windows.
What it is
A firmware update flashes new firmware onto GPU components (InfoROM, GPU VBIOS, HBM controller microcode) or supporting infrastructure (NVSwitch ASICs, InfiniBand switch firmware) to apply bug fixes, ECC handling improvements, or performance optimizations. InfoROM is updated via nvidia-smi --update-inforom; GPU VBIOS via nvidia-smi -f or nvflash; NVSwitch firmware via nvswitchctl or NVIDIA Fabric Manager tools. Activation typically requires a cold power cycle (full power off, not warm reboot).
Why it matters
Firmware updates carry higher operational risk than driver upgrades: a failed or interrupted flash can brick the device, requiring physical RMA rather than a software fix. A node cannot run workloads during the flash operation (2-10 minutes per device) or the subsequent cold reboot. Missing a critical InfoROM update that improves row remapping behavior on H100 GPUs means the fleet continues to accumulate more page retirements than necessary under the same workload conditions.
How to monitor
Track firmware versions on every GPU (nvidia-smi --query-gpu=vbios_version) and NVSwitch to identify stale devices. Run dcgmi diag -r 2 and NVLink bandwidth validation after each post-flash cold reboot before returning a node to production. Factryze tracks firmware versions fleet-wide, schedules updates during natural job boundaries, and automatically quarantines devices that fail post-flash validation for manual inspection.
Related terms
NVIDIA's NVLink switch enabling all-to-all GPU communication.
DCGM diagnostic tests (Level 1/2/3) validating GPU hardware integrity between jobs.
Sequential node restarts for kernel updates and driver upgrades while maintaining cluster capacity.
Monitor this automatically
Factryze correlates GPU signals in real time: errors, clocks, and fabric health.
Get Started Free