GPU Monitoring Glossary
48 terms across GPU errors, networking, cluster management, monitoring metrics, and operations.
Errors & Failures
10 termsGPU error types, failure modes, and diagnostic codes
CUDA Errors
CUDA runtime and driver API error codes indicating GPU compute failures.
Driver Crash
GPU kernel driver panic or hang requiring intervention to recover.
ECC Errors (Error-Correcting Code)
GPU memory bit-flip errors detected via hardware ECC, signaling degradation.
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL / DCGM_FI_DEV_ECC_DBE_VOL_TOTALGPU Fallen Off Bus
Xid 79 error: GPU completely disconnects from the PCIe bus.
NCCL Errors
Collective communication failures in NVIDIA NCCL stalling distributed training.
NVLink Errors
CRC errors and replay events on NVLink GPU-to-GPU connections.
DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTALPage Retirement
GPU firmware permanently disabling faulty memory pages after ECC errors.
DCGM_FI_DEV_RETIRED_SBE / DCGM_FI_DEV_RETIRED_DBERow Remapping
Dynamic HBM repair mechanism replacing faulty memory rows on the fly.
DCGM_FI_DEV_ROW_REMAP_FAILURE / DCGM_FI_DEV_ROW_REMAP_PENDINGUncorrectable Errors (DBE)
Double-bit ECC errors that corrupt data and halt computation.
DCGM_FI_DEV_ECC_DBE_VOL_TOTALXid Errors
NVIDIA kernel-logged Xid error codes identifying specific GPU failure modes.
Networking
10 termsGPU interconnects, fabric, and communication protocols
Adaptive Routing
Dynamic path selection in network switches to avoid congestion.
GPUDirect RDMA
Direct GPU memory access across the network, bypassing CPU copies.
InfiniBand
High-bandwidth, low-latency network fabric for GPU clusters.
Network Fabric
The physical interconnect topology connecting all nodes in a cluster.
NVLink
NVIDIA's high-bandwidth interconnect for GPU-to-GPU communication.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTALNVSwitch
NVIDIA's NVLink switch enabling all-to-all GPU communication.
Packet Drops
Lost network packets indicating congestion or hardware errors.
PCIe (PCI Express)
The host bus connecting GPUs to CPUs and other system devices.
DCGM_FI_DEV_PCIE_TX_THROUGHPUT / DCGM_FI_DEV_PCIE_RX_THROUGHPUTRoCE (RDMA over Converged Ethernet)
RDMA networking over Ethernet for GPU cluster communication.
SHARP (Scalable Hierarchical Aggregation and Reduction Protocol)
In-network compute for accelerating collective operations.
Cluster Management
8 termsScheduling, partitioning, and orchestration
Gang Scheduling
Atomic co-scheduling of all GPUs for distributed training requiring synchronized start.
GPU Partitioning
Sharing a single GPU across workloads via MIG, MPS, or time-slicing mechanisms.
Job Scheduling
Allocating GPU cluster resources using FIFO, fair-share, or priority-based policies.
MIG (Multi-Instance GPU)
Hardware partitioning on A100/H100 GPUs creating up to seven isolated GPU instances.
Node Draining
Gracefully removing a node from scheduling via kubectl drain or Slurm DRAIN state.
Preemption
Forcibly stopping lower-priority GPU jobs with checkpoint/restart to free resources.
Slurm
Open-source HPC workload manager scheduling GPU cluster jobs via srun, sbatch, and squeue.
Topology-Aware Placement
Scheduling GPU jobs by NVLink domain, NUMA affinity, and network switch locality.
Monitoring Metrics
12 termsGPU health metrics, thresholds, and telemetry
DCGM (Data Center GPU Manager)
NVIDIA's GPU management toolkit exposing health metrics via field IDs.
Fan Speed
GPU or chassis cooling fan speed as a percentage of maximum RPM.
DCGM_FI_DEV_FAN_SPEEDGPU Monitoring
Continuous tracking of GPU health, thermals, errors, and performance metrics.
GPU Utilization
Percentage of time GPU streaming multiprocessors are actively executing kernels.
DCGM_FI_DEV_GPU_UTILMemory Clock
GPU HBM/GDDR memory frequency in MHz that determines memory bandwidth.
DCGM_FI_DEV_MEM_CLOCKMemory Utilization
Percentage of GPU framebuffer memory allocated by active workloads.
DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREEPCIe Bandwidth
Measured GPU-to-host data transfer rate over PCI Express in GB/s.
DCGM_FI_DEV_PCIE_TX_THROUGHPUT / DCGM_FI_DEV_PCIE_RX_THROUGHPUTPower Capping
Limiting GPU power draw below TDP to control thermals and rack density.
DCGM_FI_DEV_POWER_USAGE / DCGM_FI_DEV_ENFORCED_POWER_LIMITRetired Pages
Cumulative count of permanently disabled GPU memory pages in InfoROM.
DCGM_FI_DEV_RETIRED_SBE / DCGM_FI_DEV_RETIRED_DBESM Clock (Streaming Multiprocessor Clock)
GPU core compute clock frequency in MHz, scaling between base and boost.
DCGM_FI_DEV_SM_CLOCKTDP (Thermal Design Power)
Maximum sustained GPU power dissipation rating, measured in watts.
DCGM_FI_DEV_ENFORCED_POWER_LIMITThermal Throttling
Automatic GPU clock reduction when die temperature exceeds 83-90C safe limits.
DCGM_FI_DEV_CLOCK_THROTTLE_REASONSOperations
8 termsMaintenance, remediation, and operational procedures
AIOps (AI for IT Operations)
AI-driven GPU infrastructure operations moving beyond traditional alerting to autonomous remediation.
Driver Reload
Reloading nvidia.ko via rmmod/modprobe to clear driver state without a full reboot.
Firmware Update
Updating GPU InfoROM, VBIOS, and NVSwitch firmware during scheduled maintenance windows.
GPU Reset
Hardware GPU reset via nvidia-smi -r with escalation to ipmitool or cold reboot.
Health Check
DCGM diagnostic tests (Level 1/2/3) validating GPU hardware integrity between jobs.
MTTR (Mean Time to Resolution)
Average 47-minute GPU issue resolution time covering detection, diagnosis, and repair.
Rolling Restart
Sequential node restarts for kernel updates and driver upgrades while maintaining cluster capacity.
Runbook
Executable remediation procedures with conditional logic and approval gates for GPU issues.