GPU Monitoring Glossary

48 terms across GPU errors, networking, cluster management, monitoring metrics, and operations.

Errors & Failures Networking Cluster Management Monitoring Metrics Operations

Errors & Failures

10 terms

GPU error types, failure modes, and diagnostic codes

CUDA Errors

CUDA runtime and driver API error codes indicating GPU compute failures.

Driver Crash

GPU kernel driver panic or hang requiring intervention to recover.

ECC Errors (Error-Correcting Code)

GPU memory bit-flip errors detected via hardware ECC, signaling degradation.

DCGM_FI_DEV_ECC_SBE_VOL_TOTAL / DCGM_FI_DEV_ECC_DBE_VOL_TOTAL

GPU Fallen Off Bus

Xid 79 error: GPU completely disconnects from the PCIe bus.

NCCL Errors

Collective communication failures in NVIDIA NCCL stalling distributed training.

NVLink Errors

CRC errors and replay events on NVLink GPU-to-GPU connections.

DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL

Page Retirement

GPU firmware permanently disabling faulty memory pages after ECC errors.

DCGM_FI_DEV_RETIRED_SBE / DCGM_FI_DEV_RETIRED_DBE

Row Remapping

Dynamic HBM repair mechanism replacing faulty memory rows on the fly.

DCGM_FI_DEV_ROW_REMAP_FAILURE / DCGM_FI_DEV_ROW_REMAP_PENDING

Uncorrectable Errors (DBE)

Double-bit ECC errors that corrupt data and halt computation.

DCGM_FI_DEV_ECC_DBE_VOL_TOTAL

Xid Errors

NVIDIA kernel-logged Xid error codes identifying specific GPU failure modes.

Networking

10 terms

GPU interconnects, fabric, and communication protocols

Adaptive Routing

Dynamic path selection in network switches to avoid congestion.

GPUDirect RDMA

Direct GPU memory access across the network, bypassing CPU copies.

InfiniBand

High-bandwidth, low-latency network fabric for GPU clusters.

Network Fabric

The physical interconnect topology connecting all nodes in a cluster.

NVLink

NVIDIA's high-bandwidth interconnect for GPU-to-GPU communication.

DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL

NVSwitch

NVIDIA's NVLink switch enabling all-to-all GPU communication.

Packet Drops

Lost network packets indicating congestion or hardware errors.

PCIe (PCI Express)

The host bus connecting GPUs to CPUs and other system devices.

DCGM_FI_DEV_PCIE_TX_THROUGHPUT / DCGM_FI_DEV_PCIE_RX_THROUGHPUT

RoCE (RDMA over Converged Ethernet)

RDMA networking over Ethernet for GPU cluster communication.

SHARP (Scalable Hierarchical Aggregation and Reduction Protocol)

In-network compute for accelerating collective operations.

Cluster Management

8 terms

Scheduling, partitioning, and orchestration

Gang Scheduling

Atomic co-scheduling of all GPUs for distributed training requiring synchronized start.

GPU Partitioning

Sharing a single GPU across workloads via MIG, MPS, or time-slicing mechanisms.

Job Scheduling

Allocating GPU cluster resources using FIFO, fair-share, or priority-based policies.

MIG (Multi-Instance GPU)

Hardware partitioning on A100/H100 GPUs creating up to seven isolated GPU instances.

Node Draining

Gracefully removing a node from scheduling via kubectl drain or Slurm DRAIN state.

Preemption

Forcibly stopping lower-priority GPU jobs with checkpoint/restart to free resources.

Slurm

Open-source HPC workload manager scheduling GPU cluster jobs via srun, sbatch, and squeue.

Topology-Aware Placement

Scheduling GPU jobs by NVLink domain, NUMA affinity, and network switch locality.

Monitoring Metrics

12 terms

GPU health metrics, thresholds, and telemetry

DCGM (Data Center GPU Manager)

NVIDIA's GPU management toolkit exposing health metrics via field IDs.

Fan Speed

GPU or chassis cooling fan speed as a percentage of maximum RPM.

DCGM_FI_DEV_FAN_SPEED

GPU Monitoring

Continuous tracking of GPU health, thermals, errors, and performance metrics.

GPU Utilization

Percentage of time GPU streaming multiprocessors are actively executing kernels.

DCGM_FI_DEV_GPU_UTIL

Memory Clock

GPU HBM/GDDR memory frequency in MHz that determines memory bandwidth.

DCGM_FI_DEV_MEM_CLOCK

Memory Utilization

Percentage of GPU framebuffer memory allocated by active workloads.

DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE

PCIe Bandwidth

Measured GPU-to-host data transfer rate over PCI Express in GB/s.

DCGM_FI_DEV_PCIE_TX_THROUGHPUT / DCGM_FI_DEV_PCIE_RX_THROUGHPUT

Power Capping

Limiting GPU power draw below TDP to control thermals and rack density.

DCGM_FI_DEV_POWER_USAGE / DCGM_FI_DEV_ENFORCED_POWER_LIMIT

Retired Pages

Cumulative count of permanently disabled GPU memory pages in InfoROM.

DCGM_FI_DEV_RETIRED_SBE / DCGM_FI_DEV_RETIRED_DBE

SM Clock (Streaming Multiprocessor Clock)

GPU core compute clock frequency in MHz, scaling between base and boost.

DCGM_FI_DEV_SM_CLOCK

TDP (Thermal Design Power)

Maximum sustained GPU power dissipation rating, measured in watts.

DCGM_FI_DEV_ENFORCED_POWER_LIMIT

Thermal Throttling

Automatic GPU clock reduction when die temperature exceeds 83-90C safe limits.

DCGM_FI_DEV_CLOCK_THROTTLE_REASONS

Operations

8 terms

Maintenance, remediation, and operational procedures

AIOps (AI for IT Operations)

AI-driven GPU infrastructure operations moving beyond traditional alerting to autonomous remediation.

Driver Reload

Reloading nvidia.ko via rmmod/modprobe to clear driver state without a full reboot.

Firmware Update

Updating GPU InfoROM, VBIOS, and NVSwitch firmware during scheduled maintenance windows.

GPU Reset

Hardware GPU reset via nvidia-smi -r with escalation to ipmitool or cold reboot.

Health Check

DCGM diagnostic tests (Level 1/2/3) validating GPU hardware integrity between jobs.

MTTR (Mean Time to Resolution)

Average 47-minute GPU issue resolution time covering detection, diagnosis, and repair.

Rolling Restart

Sequential node restarts for kernel updates and driver upgrades while maintaining cluster capacity.

Runbook

Executable remediation procedures with conditional logic and approval gates for GPU issues.