Scale AtlasChapter 1 of 88 termsUpdated 2026-05-10

Compute

Peak FLOPs are easy. Achieved FLOPs are everything. The compute story starts at the 32-thread warp, climbs through SM occupancy and tensor-core utilization, and ends at the cluster-scale roofline. FP8 doubles throughput when scaling holds. MIG and MPS slice one GPU when one job cannot fill it. Every term in this chapter is a gap between the marketed peak and what your training step actually delivers.

FP8 Numerics

8-bit floating point in two flavors (E4M3, E5M2) that halves memory and compute cost vs FP16, when scaling is correct.

Bytes1 / elementHardwareH100, B200vs FP16~2x perf

MIG Partitioning

Multi-Instance GPU divides one A100 or H100 into up to 7 fully isolated GPU slices, each with its own SMs and HBM partition. The right answer when one job cannot fill a whole GPU.

HardwareA100, H100, H200Max slices7 per GPUProfiles (H100)1g.10gb, 2g.20gb, 3g.40gb, 4g.40gb, 7g.80gb

MPS (Multi-Process Service)

NVIDIA's cooperative GPU sharing layer. Multiple CUDA processes share one GPU as a single context, with no per-process context switch. Faster than default sharing, no isolation.

Roofline Analysis at Cluster Scale

Bounds achievable performance by min(peak compute, peak bandwidth times arithmetic intensity), extended across HBM, NVLink, and InfiniBand. Tells you which bandwidth tier bounds your throughput.

SM Occupancy

Fraction of an SM's warp slots that are resident and ready to issue. Low occupancy means the scheduler cannot hide HBM latency. The number that NVIDIA optimizes for first.

Tensor Core Throughput at Scale

Peak tensor-core TFLOPS rarely match real training throughput once HBM bandwidth, kernel-launch overhead, and collective sync are accounted for. The gap is where the engineering happens.

Transformer Engine

NVIDIA's library that handles FP8 per-tensor scaling and amax tracking inside transformer layers. Without it, FP8 training silently diverges within hundreds of steps.

Warp-level Throughput

A warp is 32 threads executing in lockstep on one SM. Divergence and memory serialization waste cycles you already paid for. Warp efficiency is the single most ignored metric in GPU profiling.