Tensor Core Throughput at Scale

The number on the H100 spec sheet is 989 BF16 TFLOPS or 1979 FP8 TFLOPS. The number on your training throughput dashboard is usually less than half of that. The gap is not a defect; it is the cost of getting work onto the tensor cores in the first place.

Where the peak number comes from

A Hopper SM has 4 fourth-generation tensor cores. Each one issues a 16x8x16 BF16 matmul per cycle. With 132 SMs at 1.98 GHz on the H100 SXM5, the math is 132 * 4 * 2 * 16 * 8 * 16 * 1.98e9 / 1e12 = 989 TFLOPS. The 2 in the formula is the multiply-add (one MAC counts as 2 ops). NVIDIA's marketing peak number is exactly this calculation, and it is real: cuBLAS can hit it on a synthetic GEMM at the right sizes.

What it does not include: the cost of getting the operands into registers, the cost of moving the result out, the cost of launching the kernel, the cost of any collective the result feeds into, the cost of the host-side Python that orchestrates the sequence. Each of those is a deduction.

The deductions in order of impact

HBM bandwidth. A tensor-core matmul is a compute-to-memory ratio. Peak BF16 needs 989 TFLOPs / 3.35 TB/s = roughly 295 FLOPs per byte to be compute-bound. Most real matmuls are not that arithmetic-dense. A square 8192x8192 matmul is around 256 FLOP/byte; an attention QK^T at sequence 4096 is closer to 64. Below the ridge point on the roofline, you are HBM-bound, not tensor-core-bound, and the peak number is irrelevant. Push the kernel into compute-bound territory by increasing arithmetic intensity (larger tile sizes, fused ops, FlashAttention-style tiling) or accept the bandwidth ceiling.

Kernel launch overhead. Each cudaLaunchKernel costs roughly 5 to 20 microseconds of host-to-device negotiation. A training step calls hundreds of kernels. If your kernels run in 50 microseconds each, launch overhead is 10-40% of wall time. CUDA Graphs cut this by orders of magnitude by replaying a captured launch sequence, which is why every modern training framework (PyTorch, JAX, NeMo) uses graph capture for the inner loop. Without graphs, kernel launch is often a bigger throughput killer than HBM bandwidth on small-batch training.

Collective sync wait. Once the matmul is done, the result has to participate in an all-reduce (DDP) or reduce-scatter (FSDP). Until that collective is overlapped with the next matmul's compute, every step pays the collective's wall time on top of the compute time. Compute-comm overlap is the technique to hide this; without it, on multi-node setups, the collective wait can be 30% of the step. With it, it can be 5%.

FP precision overhead. FP8 doubles the peak number (1979 vs 989 TFLOPS), but FP8 matmul output has to be cast back to FP32 for accumulation, FP32 has to be cast to FP16 for the next layer, and the Transformer Engine tracks per-tensor amax to keep the format from underflowing. Each cast and amax check is a few percent. Profitable on Hopper, marginal on Ada, painful without TE.

What achieved actually looks like

A well-tuned 70B parameter pretraining run on H100 SXM5 lands in the 50-60% of peak BF16 range, end to end. That is "model FLOPS utilization" or MFU in the literature. The breakdown looks roughly like:

100% of theoretical peak (the spec sheet number)
minus 15% for HBM bandwidth (the matmul shapes are not all square)
minus 10% for kernel launch and graph capture overhead
minus 15% for collective sync that did not overlap perfectly
minus 10% for precision casts and amax tracking
= roughly 50% of peak achieved

If you are below 30% MFU on a transformer pretraining run, something is structurally wrong: probably either the batch size is too small to hide collective overhead, or the model has a layer that is HBM-bound and dragging the average down (look at attention with short sequences, or a projection layer with the wrong tile size).

Practical guidance

Track MFU as a first-class metric. Compute it as (FLOPs per step) / (step wall time * peak FLOPS).
If MFU is below 40%, profile before optimizing. Nsight Systems will show you whether the bottleneck is HBM, launch, or collective.
Use CUDA Graphs and torch.compile (or TE wrappers) to remove launch overhead.
Use FP8 with Transformer Engine when on H100 or B200; the format ceiling is real.
Profile per-layer; the layer that is dragging MFU down is rarely the one you would guess.

The takeaway: tensor cores are the headline number; achieving them is the engineering work. Every chapter in this atlas is, in some sense, about closing the gap between marketed peak and what your training step delivers.

Where the peak number comes from

The deductions in order of impact

What achieved actually looks like

Practical guidance

See also