Scale AtlasChapter 1 of 88 termsUpdated 2026-05-10
Compute
Peak FLOPs are easy. Achieved FLOPs are everything. The compute story starts at the 32-thread warp, climbs through SM occupancy and tensor-core utilization, and ends at the cluster-scale roofline. FP8 doubles throughput when scaling holds. MIG and MPS slice one GPU when one job cannot fill it. Every term in this chapter is a gap between the marketed peak and what your training step actually delivers.
FP8 Numerics
8-bit floating point in two flavors (E4M3, E5M2) that halves memory and compute cost vs FP16, when scaling is correct.
MIG Partitioning
Multi-Instance GPU divides one A100 or H100 into up to 7 fully isolated GPU slices, each with its own SMs and HBM partition. The right answer when one job cannot fill a whole GPU.
MPS (Multi-Process Service)
NVIDIA's cooperative GPU sharing layer. Multiple CUDA processes share one GPU as a single context, with no per-process context switch. Faster than default sharing, no isolation.
Roofline Analysis at Cluster Scale
Bounds achievable performance by min(peak compute, peak bandwidth times arithmetic intensity), extended across HBM, NVLink, and InfiniBand. Tells you which bandwidth tier bounds your throughput.
SM Occupancy
Fraction of an SM's warp slots that are resident and ready to issue. Low occupancy means the scheduler cannot hide HBM latency. The number that NVIDIA optimizes for first.
Tensor Core Throughput at Scale
Peak tensor-core TFLOPS rarely match real training throughput once HBM bandwidth, kernel-launch overhead, and collective sync are accounted for. The gap is where the engineering happens.
Transformer Engine
NVIDIA's library that handles FP8 per-tensor scaling and amax tracking inside transformer layers. Without it, FP8 training silently diverges within hundreds of steps.
Warp-level Throughput
A warp is 32 threads executing in lockstep on one SM. Divergence and memory serialization waste cycles you already paid for. Warp efficiency is the single most ignored metric in GPU profiling.