Skip to main content
GPU Glossary/Operations
Operations

Firmware Update

Updating GPU InfoROM, VBIOS, and NVSwitch firmware during scheduled maintenance windows.

What it is

A firmware update flashes new firmware onto GPU components (InfoROM, GPU VBIOS, HBM controller microcode) or supporting infrastructure (NVSwitch ASICs, InfiniBand switch firmware) to apply bug fixes, ECC handling improvements, or performance optimizations. InfoROM is updated via nvidia-smi --update-inforom; GPU VBIOS via nvidia-smi -f or nvflash; NVSwitch firmware via nvswitchctl or NVIDIA Fabric Manager tools. Activation typically requires a cold power cycle (full power off, not warm reboot).

Why it matters

Firmware updates carry higher operational risk than driver upgrades: a failed or interrupted flash can brick the device, requiring physical RMA rather than a software fix. A node cannot run workloads during the flash operation (2-10 minutes per device) or the subsequent cold reboot. Missing a critical InfoROM update that improves row remapping behavior on H100 GPUs means the fleet continues to accumulate more page retirements than necessary under the same workload conditions.

How to monitor

Track firmware versions on every GPU (nvidia-smi --query-gpu=vbios_version) and NVSwitch to identify stale devices. Run dcgmi diag -r 2 and NVLink bandwidth validation after each post-flash cold reboot before returning a node to production. Factryze tracks firmware versions fleet-wide, schedules updates during natural job boundaries, and automatically quarantines devices that fail post-flash validation for manual inspection.

Firmware Update - GPU Firmware Update ProcessFirmware Update - GPU Firmware Update Process
Pinch to zoom, drag to pan, double-tap to toggle
Firmware Update - GPU Firmware Update ProcessFirmware Update - GPU Firmware Update Process

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free