SM Occupancy
A streaming multiprocessor on an H100 has room for 64 resident warps. The scheduler picks one ready warp per cycle and issues an instruction from it. Occupancy is the fraction of those 64 slots that are actually filled. Hit 75% and the SM has more than enough warps to swap in when the current one stalls on HBM; hit 20% and the SM sits idle every time a load misses the cache.
What occupancy actually measures
Occupancy is not "how busy is the GPU." It is "how many warps does the scheduler have to choose from on this SM." The distinction matters because a kernel can be fully utilizing one warp at 100% and still have 5% occupancy on the SM. The headline metric gpu-utilization reports SM activity, not occupancy, so a kernel with 90% utilization can still be occupancy-starved.
The cap on warps per SM comes from three resources: the active warp count itself (hardware limit, 64 on H100), register file (65,536 32-bit registers per SM, divided across resident warps), and shared memory (228 KB per SM on H100, divided across resident blocks). If your kernel uses 128 registers per thread and 96 KB of shared memory per block, the register file alone limits you to 16 warps resident, which is 25% occupancy. The compiler reports register usage with -Xptxas -v; the CUDA Occupancy Calculator (now built into Nsight Compute) tells you what cap is binding.
Why occupancy is what hides HBM latency
HBM3 on H100 delivers about 3.35 TB/s of memory bandwidth, which sounds like a lot until you remember each load has a latency of roughly 400 to 600 cycles. If only 12 warps are resident, the scheduler exhausts its ready warps in well under that latency, and the SM sits idle until the load returns. With 56 warps resident, the scheduler always has a ready warp to issue while others wait, and the SM never stops working. This is exactly the trick that makes GPUs throughput-oriented: massive thread-level parallelism hides per-instruction latency.
The implication is non-obvious: maximizing per-thread efficiency by jamming registers full of cached values can hurt total throughput, because it drops occupancy below the latency-hiding threshold. The CUDA performance lore "reduce register pressure" is really "do not let register usage cap occupancy below where the scheduler runs out of ready warps." For modern GPUs that threshold is roughly 50%; below that, HBM stalls become visible in the trace.
What kills occupancy in real kernels
Three patterns dominate:
- Register spills. Compiler allocates more registers per thread than the hardware can give, spills to local memory (which is HBM with a fancy name), and now every spill is a memory access. Show up as "stack frame" entries in
ptxasoutput. - Shared memory hogs. Each block reserves shared memory at launch. If a block needs 96 KB of smem on a 228 KB SM, you can fit 2 blocks per SM. Two blocks of 8 warps each is 16 warps, 25% occupancy.
- Block size mismatch. Launch with 32 threads per block on a kernel that scales to 1024 and you cap at one block per SM, which is one warp, which is sub-1% occupancy. The classic "I forgot to set blockDim" mistake.
FlashAttention and recent Triton-compiled kernels deliberately target high occupancy by keeping per-thread state minimal and using shared memory as a working set rather than a cache. The numbers move with hardware (H100 has more shared memory and more registers than A100, so kernels tuned for A100 occupancy may not be tuned for H100), and re-tuning every generation is part of the cost of running on the latest silicon.
Practical guidance
- Profile with Nsight Compute. The "Achieved Occupancy" metric is the truth; the theoretical occupancy from the calculator is the ceiling.
- For attention and matmul kernels, target occupancy at or above 50%.
- Watch for occupancy drops when you change block size, when you add a new tensor to shared memory, or when you upgrade CUDA versions (compiler register allocation changes between versions).
- Do not chase 100%. The relationship is non-linear; going from 25% to 50% is huge, going from 50% to 75% is small, going to 100% is often impossible without giving up performance elsewhere.
The scheduler is the GPU's secret weapon. Occupancy is what feeds it. See warp-level throughput for the cycle below this one and tensor-core throughput for what the warps are actually doing on Hopper-class chips.
See also
Updated 2026-05-10