Skip to main content

Roofline Analysis at Cluster Scale

Bounds achievable performance by min(peak compute, peak bandwidth times arithmetic intensity), extended across HBM, NVLink, and InfiniBand. Tells you which bandwidth tier bounds your throughput.
Original
Williams, Waterman, Patterson (2009)
Single-GPU axes
AI vs achieved FLOPS
Cluster axes
+ NVLink, IB, GDS slopes

The roofline model, in one sentence: achievable performance equals the minimum of (peak compute) and (peak bandwidth times arithmetic intensity). Drawn on a log-log plot with arithmetic intensity (FLOP per byte) on the x-axis and achieved TFLOPS on the y-axis, you get a horizontal ceiling for compute and a sloped line for bandwidth. Workloads sit on whichever one binds them. The model was published in 2009 by Williams, Waterman, and Patterson; it is the most useful single picture in performance engineering.

Why it matters at cluster scale

The original roofline assumes a single bandwidth tier (HBM, the GPU's local memory). At cluster scale, you have at least three tiers, each with its own slope:

  1. HBM at roughly 3.35 TB/s on H100. Steep slope; saturates fast.
  2. NVLink + NVSwitch at roughly 900 GB/s per GPU. Medium slope.
  3. InfiniBand NDR at roughly 50 GB/s per GPU node. Shallow slope.

A workload that fits inside one GPU's HBM rides the HBM slope. A workload that has to hit NVLink (e.g., model parallelism within a node) rides the NVLink slope, which intersects the compute ceiling at higher arithmetic intensity. A workload that crosses InfiniBand (multi-node collectives) rides the IB slope, which barely rises at all before vanishing into bandwidth-bound territory.

Cluster-scale roofline is what tells you whether your bottleneck is compute, HBM, or interconnect. The answer dictates the optimization. If you are HBM-bound, more flops do nothing; you need fusion or higher arithmetic intensity. If you are NVLink-bound, you need topology-aware placement. If you are IB-bound, you need compute-comm overlap or a smarter parallelism strategy.

TFLOPSFLOP / bytepeak (989 TF BF16)HBM (3.35 TB/s)NVLink (0.9 TB/s)InfiniBandmatmulattentionall-reduceeach workload sits on the slope of whichever bandwidth bounds it

Reading workloads on the chart

Three reference points anchor the plot:

Matmul. A square 8192x8192 BF16 matmul has arithmetic intensity around 256 FLOP/byte (roughly N/3 for square matrices in BF16). On the H100 ridge point is about 295 FLOP/byte; matmul lands just below the ceiling, riding HBM. Compute-bound up to a few percent. This is why cuBLAS hits 800-900 TFLOPS on the right shapes; the math says it should.

Attention. Standard attention without FlashAttention is heavily memory-bound. The QK^T matmul has good arithmetic intensity (similar to a regular matmul), but the softmax and dropout produce intermediate tensors that have to round-trip through HBM. Effective AI is closer to 30-60 FLOP/byte, well below the HBM ridge. Result: attention runs at maybe 30-40% of peak. FlashAttention fuses the kernels and rewrites the algorithm to keep softmax in shared memory, raising effective AI to 100+ FLOP/byte and pulling attention into the compute-bound region.

All-reduce. Pure communication, zero arithmetic per byte. AI is essentially zero; the workload sits on the InfiniBand slope at the far left of the plot. There is no compute-bound regime for all-reduce; it is bandwidth-bound by definition. The only way to make it faster is to send fewer bytes (FP16 instead of FP32, gradient bucketing to amortize fixed cost), or hide the transfer behind compute (overlap).

What the model does not capture

Roofline is a ceiling, not a prediction. Three things it does not model:

  1. Latency. The model assumes steady-state throughput. A workload with many small kernels can be far below the roofline because of kernel-launch overhead, even though each individual kernel is well-tuned. CUDA Graphs and torch.compile exist to fix this.

  2. Imperfect overlap. Real pipelines have stalls where compute and communication do not overlap perfectly. The roofline assumes you saturate one resource at a time; a poorly-scheduled training step can underperform both ceilings simultaneously.

  3. Multiple resources at once. Modern Hopper SMs can do BF16 matmul and HBM reads concurrently (the tensor cores and the LD/ST units are separate hardware). The simple roofline does not capture this; the "hierarchical roofline" extension does, but few teams build the chart.

The honest use of the model: draw the chart, plot your workloads, and ask "which slope binds me right now?" If the answer is HBM, do not optimize tensor cores. If the answer is IB, do not optimize HBM. The roofline is most useful as a forcing function for asking the right question, not as a precise predictor.

Practical guidance

  • Build the roofline chart for your specific GPU and parallelism strategy. The slopes change with hardware (H200 has more HBM bandwidth than H100, B200 more again) and with topology.
  • Profile with Nsight Compute to get measured arithmetic intensity per kernel. Plot the dots; do not estimate.
  • Optimize against the binding ceiling, not against a guess. The headline TFLOPS number on the spec sheet is the compute ceiling; the achieved TFLOPS for your workload is wherever roofline puts you.
  • Use FP8 with Transformer Engine to raise the compute ceiling (1979 TF instead of 989); on workloads that were close to compute-bound on BF16, this is the single biggest knob you have.

The takeaway: roofline is the chart that tells you which bottleneck is real. Every term in this chapter is a story about closing the gap between peak and achieved; roofline is the picture that puts them all on one axis.

See also

Updated 2026-05-10