FP8 Numerics

FP8 is a pair of 8-bit floating-point formats (E4M3 and E5M2) introduced with NVIDIA's Hopper generation. They cut memory bandwidth and compute cost in half versus FP16 while preserving enough numeric range to train modern transformers, provided the surrounding software stack handles per-tensor scaling correctly.

The two formats

E4M3 has 4 exponent bits and 3 mantissa bits, biased toward representable range. E5M2 has 5 exponent bits and 2 mantissa bits, biased toward dynamic range. NVIDIA's Transformer Engine uses E4M3 for forward activations and E5M2 for gradients during backprop, because gradients have a wider dynamic range than activations.

Why it matters at fleet scale

A 70B parameter model in FP8 fits in roughly 70 GB of HBM versus 140 GB in FP16. That difference is the gap between fitting on a single H100 SXM5 (80 GB HBM3) and needing two GPUs plus an NVLink domain to span them. Across a 128-GPU pretraining run, the saved HBM compounds: more batch fits per step, fewer gradient-accumulation cycles, less collective traffic.

The catch: per-tensor scaling

Without scaling, gradients underflow in E5M2 and training quietly diverges. The Transformer Engine handles this by tracking amax per tensor and rescaling on the fly:

# Transformer Engine, simplified
import transformer_engine.pytorch as te
from transformer_engine.common.recipe import Format, DelayedScaling
 
recipe = DelayedScaling(fp8_format=Format.HYBRID, amax_history_len=16)
 
with te.fp8_autocast(enabled=True, fp8_recipe=recipe):
    out = te.Linear(in_features=4096, out_features=4096)(x)
# amax history is tracked per tensor; weights stored in FP16 master copy

If you run FP8 without TE or an equivalent scaling library, expect loss spikes within hundreds of steps. The failure looks like a numerics issue rather than a hardware fault, which makes it harder to diagnose: the same loss-spike pattern can come from a flaky GPU, a bad data shard, or an FP8 amax overflow. See Stragglers and Blast Radius for the operational angle.

Practical guidance

Use Transformer Engine on H100 and B200 for transformer workloads. The win is real and the integration cost is small.
Validate the loss curve against the FP16 baseline for the first thousand steps before scaling out.
Keep weights in an FP16 or BF16 master copy. FP8 is for activations and matmul throughput, not for storage of trained parameters.
Profile the actual HBM occupancy. If activations were not the bottleneck, FP8 will not buy you much.

The two formats

Why it matters at fleet scale

The catch: per-tensor scaling

Practical guidance

See also