GPU Utilization Optimization: How to Push from 50% to 90%
Practical guide to improving GPU cluster utilization. Identify scheduling gaps, memory fragmentation, thermal headroom, and power capping tradeoffs to maximize throughput and reduce waste.
The average GPU cluster runs at 40-60% utilization. On a 1,000-GPU cluster of H100 SXMs, that means 400-600 GPUs worth of compute are sitting idle at any given moment - roughly $15-25 million per year in wasted capital expenditure. GPU utilization optimization is not about squeezing marginal gains from individual kernels. It is about identifying and eliminating the systemic gaps - scheduling inefficiencies, memory fragmentation, thermal constraints, and power limits - that keep your cluster running at half capacity.
This guide breaks down the five most common causes of low GPU utilization in production clusters and provides concrete steps to address each one.
What Does GPU Utilization Actually Measure?
Before optimizing utilization, you need to understand what DCGM_FI_DEV_GPU_UTIL actually reports. This metric measures temporal occupancy - the percentage of time during a sampling window in which at least one GPU kernel is executing on the streaming multiprocessors. It does not measure how efficiently those kernels use the GPU's compute resources.
A memory-bound kernel that stalls waiting for HBM data can show 100% GPU utilization while leaving the majority of CUDA cores idle. Conversely, a highly optimized kernel that finishes quickly and then waits for the next batch from the data pipeline shows 40% utilization despite being computationally efficient.
This distinction matters because different utilization problems require different fixes:
- Low temporal utilization (GPU idle between kernels): scheduling, pipeline, and orchestration problems
- Low compute efficiency (GPU busy but underperforming): kernel optimization, memory bandwidth, and hardware constraints
- Low allocation utilization (GPUs allocated but not used): capacity planning and workload management problems
The metric you see in Grafana or your GPU monitoring dashboard is temporal utilization. The other two require correlating utilization with SM clock, memory bandwidth, and workload-level metrics.
How Do Scheduling Gaps Kill GPU Utilization?
Scheduling inefficiency is the single largest contributor to low cluster utilization. In a typical Slurm or Kubernetes cluster, GPUs sit idle in three phases: between jobs (queue gaps), during job startup (initialization overhead), and during job teardown (cleanup and checkpointing).
Queue Gaps and Fragmentation
When a 64-GPU training job finishes but the next queued job needs 128 GPUs, those 64 GPUs sit idle until enough additional GPUs free up to satisfy the larger request. This is resource fragmentation - available GPUs are scattered across the cluster in chunks too small for the next pending job.
Research from 2025 shows that resource fragmentation alone can reduce cluster utilization by 15-25%. Dynamic multi-objective scheduling approaches like Hybrid Priority Scheduling have demonstrated the ability to push utilization to 78% by combining priority-based allocation with gap-filling backfill that places smaller jobs into the idle fragments.
What to do about it:
- Enable backfill scheduling in Slurm (
SchedulerType=sched/backfill) so smaller jobs can fill gaps while large jobs wait for resources - Set realistic walltime limits on all jobs - overly generous time limits prevent the backfill scheduler from identifying usable gaps
- Consider gang scheduling with elastic job sizing, where frameworks like PyTorch Elastic can scale to whatever GPU count is currently available
- Monitor the ratio of pending-to-running jobs by GPU count requested - a persistent backlog of large jobs with available small fragments signals a fragmentation problem
Data Loading and Pipeline Stalls
A training job that shows 95% GPU utilization during compute steps but drops to 0% for 200ms between steps has a data pipeline stall. The GPU finishes processing a batch faster than the CPU can prepare the next one. Across thousands of training steps, these micro-gaps accumulate into significant utilization loss.
Common causes include:
- Insufficient DataLoader workers (the default of 2 is almost never enough for multi-GPU training)
- Data stored on network filesystems with high latency
- CPU preprocessing bottlenecks (image augmentation, tokenization)
- Synchronous data loading instead of prefetching
What to do about it:
- Increase DataLoader
num_workersto 4-8 per GPU and enablepin_memory=True - Use
prefetch_factor=2or higher to overlap data loading with GPU compute - Move training data to local NVMe storage or a high-throughput parallel filesystem
- Profile with PyTorch Profiler or Nsight Systems to visualize the gap between data loading and kernel execution
Synchronization Barriers in Distributed Training
In data-parallel training, every GPU must complete its forward pass, backward pass, and gradient AllReduce before any GPU can start the next step. The slowest GPU in the group gates the entire job. If one GPU is 5% slower due to thermal throttling, power constraints, or a degraded NVLink, every other GPU wastes 5% of its time waiting.
This straggler effect compounds with scale. In a 256-GPU job, the probability that at least one GPU is slightly degraded approaches certainty.
What to do about it:
- Monitor per-GPU step time variance to identify stragglers before they accumulate hours of wasted compute
- Use gradient accumulation to overlap communication with computation
- Enable NCCL async error handling to prevent a single slow GPU from blocking the entire communicator
- Implement health-based GPU selection that excludes GPUs showing early degradation signals from large training jobs
How Does Memory Fragmentation Waste GPU Capacity?
Memory utilization is the second dimension of GPU efficiency. GPUs allocated to workloads frequently have significant unused memory, representing capacity that could serve additional work.
The Over-Allocation Problem
Operators request GPU memory based on peak requirements plus a safety margin. A model that peaks at 60 GB during training might be allocated a full 80 GB H100, leaving 20 GB permanently unused. Across a 500-GPU cluster, this over-allocation pattern wastes 10 TB of aggregate GPU memory.
The situation is worse for inference workloads. A 7B parameter model running inference in FP16 needs roughly 14 GB of memory. Allocating an entire 80 GB H100 to this workload wastes 82% of the GPU's memory capacity.
What to do about it:
- Use MIG (Multi-Instance GPU) partitioning on A100 and H100 GPUs to carve them into right-sized instances. A single H100 can serve seven
1g.10gbinference workloads that would otherwise each consume a whole GPU. - Track
DCGM_FI_DEV_FB_USEDalongsideDCGM_FI_DEV_FB_FREEto identify GPUs where less than 50% of framebuffer is allocated - these are candidates for workload consolidation. - Implement memory-aware scheduling that bins workloads by memory requirements and packs them onto appropriately sized GPU partitions.
Memory Leaks and Fragmentation Over Time
Long-running inference servers and Jupyter notebooks often exhibit GPU memory leaks - gradual increases in DCGM_FI_DEV_FB_USED over hours or days without corresponding workload increases. Memory leaks eventually trigger OOM kills, which not only crash the workload but leave the GPU in a state that may require a reset to fully reclaim memory.
What to do about it:
- Monitor the derivative of
DCGM_FI_DEV_FB_USEDover time - a steady upward trend in a steady-state workload indicates a leak - Set memory utilization alerts at 90% to catch leaks before they cause OOM kills
- Implement automatic workload recycling for inference servers (rolling restart every 24-48 hours)
How Does Thermal Headroom Cap GPU Utilization?
Thermal throttling silently reduces GPU throughput by capping clock frequencies when die temperature exceeds safe limits. The effect is progressive: a GPU at 85 degrees Celsius loses 5-10% of its boost clock, while a GPU approaching 90 degrees can lose 30-40%. Because DCGM_FI_DEV_GPU_UTIL still reads 95-100% during throttling (the GPU is busy, just slower), thermal throttling is invisible to basic utilization monitoring.
Identifying Thermal Constraints
The key metric is DCGM_FI_DEV_CLOCK_THROTTLE_REASONS, a bitmask that shows exactly why a GPU's clocks are reduced. Bit 2 (HW Slowdown) indicates temperature-driven throttling, bit 3 (HW Thermal Slowdown) signals critical thermal events, and bit 4 (SW Thermal Slowdown) shows driver-enforced thermal policy.
Common thermal patterns in data center GPU clusters:
- Time-of-day correlation: GPUs throttle more during afternoon hours when ambient temperature peaks and CRAC units struggle to maintain setpoint. This can cause a 5-15% throughput swing between morning and afternoon.
- Position-dependent hotspots: GPUs in the rear of a rack or in higher rack positions run hotter due to exhaust recirculation. In an 8-GPU DGX node, GPUs 4-7 consistently run 3-5 degrees hotter than GPUs 0-3.
- Workload-dependent thermal profiles: Mixed-precision training and large GEMM operations generate more heat than memory-bound inference workloads. A GPU that runs cool during inference may throttle immediately when switched to training.
Recovering Thermal Headroom
Short-term fixes:
- Apply power caps to thermally constrained GPUs. Reducing an H100 from 700W to 600W drops throughput by only 5-8% for memory-bound workloads but reduces heat output by 14%, often enough to eliminate throttling entirely.
- Redistribute workloads away from hot positions to GPUs with more thermal headroom.
Long-term fixes:
- Work with data center facilities to improve airflow management (hot aisle/cold aisle containment, blanking panels, cable management).
- Upgrade cooling infrastructure for GPU-dense deployments - an 8x H100 node at full power draws 10.2 kW, which legacy cooling systems were not designed to handle.
- Consider liquid-cooled GPU deployments for new builds, which eliminate thermal throttling as a concern.
What Are the Tradeoffs of GPU Power Capping?
Power capping is the deliberate practice of setting a GPU's power limit below its default TDP (Thermal Design Power) to control heat output, manage rack power density, or fit more GPUs into a power-constrained facility. The tradeoff is straightforward: lower power means lower clock frequencies means lower throughput. But the relationship is non-linear, and understanding that non-linearity is the key to optimizing the tradeoff.
The Power-Performance Curve
Reducing an H100 SXM from its default 700W TDP to 600W (a 14% power reduction) typically reduces throughput by only 5-8% for memory-bandwidth-bound workloads like LLM inference and attention computation. This is because memory-bound kernels are limited by HBM bandwidth, not compute frequency, so reducing the SM clock has a proportionally smaller impact.
For compute-bound workloads like large GEMM operations in training, the same 100W power reduction causes a 12-15% throughput drop because the workload is directly gated by SM clock frequency.
This means power capping is not uniformly good or bad. It depends on your workload mix:
- Inference-heavy clusters: Aggressive power capping (600-650W on H100) yields significant power and cooling savings with minimal throughput impact
- Training-heavy clusters: Conservative power capping (650-680W on H100) provides thermal stability without materially slowing training
- Mixed workloads: Dynamic power management that adjusts caps based on the currently running workload
Power Capping for Density
In power-constrained data centers, power capping enables higher GPU density. Running 1,000 H100 GPUs at 600W instead of 700W saves 100 kW - enough headroom to deploy an additional 140+ GPUs at the capped power level. If your facility power is the binding constraint, running 1,140 GPUs at 600W delivers more aggregate throughput than 1,000 GPUs at 700W, even accounting for the per-GPU throughput reduction.
Monitor DCGM_FI_DEV_POWER_USAGE against DCGM_FI_DEV_ENFORCED_POWER_LIMIT to identify GPUs where the power cap is actively constraining performance. If actual power draw consistently equals the enforced limit, the cap is binding and any further reduction will have a proportionally larger throughput impact.
How Does the Performance Agent Optimize GPU Utilization?
Optimizing GPU utilization manually requires correlating dozens of metrics across hundreds of GPUs, comparing workload profiles, adjusting scheduling parameters, and continuously rebalancing as workloads change. This is exactly the kind of multi-variable, continuous optimization problem that autonomous agents handle well.
Factryze's Performance Agent addresses each utilization dimension:
Scheduling optimization: The agent monitors queue depth, job wait times, and resource fragmentation across the cluster. It identifies when backfill opportunities are being missed, when job sizing creates persistent fragmentation, and recommends scheduling parameter adjustments to close utilization gaps.
Pipeline stall detection: By correlating per-GPU utilization patterns with step timing, the agent detects data loading bottlenecks at the job level. When a GPU oscillates between 95% and 0% utilization at a regular interval, the agent flags it as a pipeline stall and identifies the bottleneck (DataLoader workers, storage throughput, CPU preprocessing).
Straggler identification: The agent compares utilization, SM clock, and step time across all GPUs in a distributed training job. A GPU that consistently runs 3-5% slower than its peers is flagged, and the agent correlates the slowdown with thermal, power, or hardware signals to determine whether the straggler can be remediated or should be replaced.
Thermal and power management: Dynamic power capping based on real-time thermal conditions, workload type, and facility constraints. Rather than static power caps applied uniformly, the agent adjusts per-GPU power limits to maximize aggregate throughput within thermal and power envelopes.
To see how these optimizations apply to your cluster, review our plans and pricing or contact us for a fleet analysis.
Frequently Asked Questions
Why does my GPU show 95% utilization but training is slow?
High DCGM_FI_DEV_GPU_UTIL does not mean high throughput. Utilization measures temporal occupancy (is the GPU executing something?), not computational efficiency. Check DCGM_FI_DEV_SM_CLOCK to see if the GPU is thermally or power throttled - a GPU running at base clock (1620 MHz on H100) instead of boost clock (1980 MHz) is delivering 18% less compute while still showing near-100% utilization. Also check DCGM_FI_DEV_CLOCK_THROTTLE_REASONS to identify the specific constraint. Read more about thermal throttling and how to detect it in our post on silent GPU failures.
What is a good target for GPU cluster utilization?
For training-focused clusters, 75-85% average utilization is achievable with proper scheduling, backfill, and workload management. For inference clusters, utilization targets depend on latency SLAs - you need headroom for burst traffic, so 60-70% sustained utilization is often appropriate. Getting above 90% sustained utilization usually requires some combination of elastic job sizing, aggressive backfill scheduling, and GPU partitioning with MIG.
Does power capping always reduce utilization?
No. Power capping reduces throughput (work done per unit time), not utilization (time the GPU is active). A power-capped GPU running a memory-bound workload may show the same utilization percentage at lower throughput. In some cases, power capping can actually improve effective utilization by preventing thermal throttling, which would otherwise cause more severe performance degradation than a controlled power cap.
How much utilization improvement can I expect from scheduling optimization alone?
Industry research shows that advanced scheduling approaches (backfill, gap-filling, dynamic priority) can improve cluster utilization by 15-25 percentage points compared to simple FIFO scheduling. The biggest gains come from enabling backfill scheduling with accurate job walltime estimates, which allows the scheduler to fit small jobs into gaps that would otherwise remain idle while large jobs wait for resources.
Monitor your GPU cluster with Factryze
Deploy autonomous agents that detect, diagnose, and optimize GPU infrastructure - in under 5 minutes.