Warp-level Throughput

The warp is the unit of work the NVIDIA GPU actually executes. 32 threads, one program counter, one instruction per cycle. Every CUDA kernel you write, every PyTorch op you call, every cuBLAS GEMM, decomposes down to warps that the SM schedules. If your warps are not full, your GPU is not full, and no amount of HBM bandwidth or NVLink width fixes it.

What a warp is and why 32

A warp is 32 threads bundled together by the hardware. They share a program counter, an active mask, and an issue slot on the SM. When you launch a kernel with 256 threads per block, the hardware groups them into 8 warps of 32 and schedules each one as a single SIMT instruction stream. The warp is the smallest unit the scheduler reasons about. Threads inside a warp do not have independent control flow; they have an active mask that selects which of the 32 lanes participate in the current instruction.

The 32 number is a hardware choice. AMD GPUs use 64-lane "wavefronts." NVIDIA picked 32 in the late 2000s and has not moved off it since. The implication: if your problem decomposes naturally into multiples of 32, you are fine; if it decomposes into 7 or 9 or 17, you waste lanes on every warp.

Divergence: where warp throughput dies

The active mask is the trap. When a warp hits an if (threadIdx.x % 2 == 0) branch, the hardware does not magically run two streams. It runs the if side with a mask of 16 lanes active and 16 masked off, then runs the else side with the mask flipped. Each pass consumes a full instruction issue slot. Two-way divergence: 2x cost. Eight-way divergence on something like a switch statement: 8x cost. The lanes that were masked off still consumed their share of the cycle; you just got no useful work out of them.

The pattern that bites people: indexing into a lookup table where the index depends on data. result = table[input[tid]]. If input[tid] lands on different cache lines for adjacent threads, you get memory divergence (the load takes multiple cycles), and if downstream control flow depends on the value, you get branch divergence on top. Profile the kernel with Nsight Compute and the "Warp Execution Efficiency" metric will tell you what fraction of your active lanes did real work. Below 80% is a smell; below 50% means you have a structural problem in the kernel.

Why this matters at fleet scale

A single low-efficiency kernel costs you maybe 10% of one GPU. At fleet scale that 10% is the difference between fitting your training run on the budget you have and renting another rack. NVIDIA's tensor-core-bound kernels (cuBLAS, cuDNN, FlashAttention) are tuned to within a percent of warp-execution efficiency. Custom kernels written by ML engineers under deadline pressure are not. The cost shows up as "we are getting 60% of peak BF16, why?" and the answer is almost always either warp divergence, low SM occupancy, or HBM bandwidth saturation.

The runbook for diagnosing it:

Run the kernel under Nsight Compute with --metrics smsp__average_warp_execution_efficiency.pct.
If under 80%, look at the source-level divergence map and find the offending branch.
If above 90%, the kernel is warp-clean. Look elsewhere (occupancy, memory, collectives).

Practical guidance

Do not write custom CUDA when a tuned library exists. cuBLAS, cuDNN, CUTLASS, FlashAttention, and Triton-compiled kernels all hit warp efficiency that hand-rolled code rarely matches. Your time is better spent on the layer above (kernel selection, fusion, layout) than on lane-level optimization.

When you must write custom kernels, design for the warp first. Group threads that take the same control-flow path into the same warp. Sort or partition data so that adjacent threads see similar values. Use __ballot_sync and __shfl_sync to communicate inside a warp before falling back to shared memory. The warp is the cheapest synchronization domain on the GPU; do not waste it.

The takeaway: tensor cores get the marketing, warps get the throughput. If your warps are not full and not convergent, the GPU's headline TFLOPS number is a fiction. Read it as "this is the ceiling assuming I never write the bug that empties my warps." See SM occupancy for the next layer up: how many warps the SM is actually running at once.

What a warp is and why 32

Divergence: where warp throughput dies

Why this matters at fleet scale

Practical guidance

See also