GPU Utilization

Percentage of time GPU streaming multiprocessors are actively executing kernels.

What it is

GPU utilization measures the percentage of time during a sampling window in which the GPU's streaming multiprocessors are actively executing at least one kernel, reported as a 0-100% value via DCGM_FI_DEV_GPU_UTIL. It measures temporal occupancy, not computational efficiency -- a memory-bound kernel can show 100% utilization while leaving the majority of CUDA cores idle. Average GPU utilization across production data center clusters ranges from 30-60%.

Why it matters

Sustained GPU utilization below 15% on an allocated GPU indicates workload misconfiguration, a stalled training process, or a GPU that has silently fallen out of a distributed job while the process remains alive. During a 256-GPU training run, one GPU dropping from 95% to 8% while peers remain at 95% signals a data pipeline stall or NCCL hang -- every second it goes undetected wastes all 256 GPUs. High utilization does not guarantee efficiency: a throttled GPU can show 95-100% utilization while delivering 30% less throughput than a healthy peer.

How to monitor

Track DCGM_FI_DEV_GPU_UTIL continuously and alert on intra-job utilization divergence -- a single rank dropping 20+ percentage points below its peers is a reliable straggler signal. Correlate with DCGM_FI_DEV_SM_CLOCK to distinguish true idle from throttled states. Factryze's Performance Agent monitors utilization in real time, correlates it with SM clock and memory bandwidth, and alerts teams to anomalies indicating wasted capacity or degraded jobs.

DCGM Metric Field

DCGM_FI_DEV_GPU_UTIL

GPU core compute clock frequency in MHz, scaling between base and boost.

Memory Utilization

Percentage of GPU framebuffer memory allocated by active workloads.

GPU Monitoring

Continuous tracking of GPU health, thermals, errors, and performance metrics.

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free

GPU Utilization

What it is

Why it matters

How to monitor

Related terms