NVMe over Fabrics

A GPU server with 8 H100s and 8 NVMe drives is a sensible starting point until you realize that some workloads need 100 TB of fast storage and others need 1 TB. Local NVMe is too coarse-grained for elastic provisioning. NVMe over Fabrics (NVMe-oF) decouples the drive from the server: the drive lives in a separate JBOF (just a bunch of flash) chassis, the GPU server reads it over RDMA, and the application sees the same NVMe queue interface it would for local SSD.

What it actually does

NVMe-oF is a small layer that wraps NVMe submission and completion queues in RDMA messages. The application calls pread() or uses cuFile, the kernel's NVMe driver enqueues a work request, the NIC packages it as an RDMA write to the target, the target's NIC delivers it to its NVMe controller, and the SSD performs the I/O. The completion queue entry comes back the same way. From the application's perspective, this is indistinguishable from a local NVMe read; from the fabric's perspective, it is just another RDMA message.

The wire protocol comes in three flavors. NVMe-oF over RDMA is the dominant production form: it runs on InfiniBand or RoCE and uses the same RDMA verbs as GPUDirect RDMA. NVMe-oF over Fibre Channel is for storage networks that already speak FC. NVMe-oF over TCP exists but is slower and rarely used for AI training.

The latency story

Local NVMe inside the same PCIe domain runs roughly 10 microseconds for a 4K random read, dominated by the SSD's controller processing time. NVMe-oF adds the round-trip time of one RDMA exchange, which on a tuned IB or RoCE fabric is roughly 1-2 microseconds. So NVMe-oF latency is 11-12 microseconds, a 10-20% increase over local. Sequential bandwidth tops out at the fabric line rate divided by RDMA efficiency, which on NDR (50 GB/s per port) gives roughly 45 GB/s sustained per connection.

The latency increase is small enough that for most AI training workloads (which are bandwidth-bound, not latency-bound), NVMe-oF is indistinguishable from local. The bandwidth is what changes: if your storage fabric has more bandwidth per server than your local PCIe lanes provide, NVMe-oF can actually be faster than local for some workloads.

Why it changes how clusters are designed

The big win of NVMe-oF is provisioning flexibility. With local NVMe, the storage capacity per GPU server is fixed at build time (e.g., 8x 7.68 TB = 61 TB per server). With NVMe-oF, the same 61 TB lives in a JBOF chassis that can be carved up: one GPU server might mount 100 TB for a large dataset, another might mount 5 TB for a small one. The total fleet capacity stays the same; the per-server slice is elastic.

This is what enables shared NVMe pools backing parallel filesystems. WekaFS, for example, builds its global namespace on top of NVMe-oF: every storage node exports its NVMe drives as NVMe-oF targets, and every client mounts them as a single pool. Lustre vs WekaFS covers the architectural differences in detail.

When NVMe-oF replaces local

The decision tree is roughly:

For ephemeral scratch space (training intermediate buffers, cache for object-store reads): local NVMe is simpler and just as fast.
For shared dataset storage that multiple jobs need: NVMe-oF as the backing layer for a parallel filesystem is the standard pattern.
For checkpoints that need to outlive the GPU server: NVMe-oF with replication, or a parallel filesystem on top of it.
For long-term archival: object store (S3 tier for training), not NVMe.

What this means in practice

NVMe-oF over RDMA is the production form. Use IB or RoCE, not TCP, unless you have a specific reason to avoid RDMA.
The fabric tier matters. NVMe-oF on a rail-optimized fat-tree will share the IB fabric with collective traffic; provision the bisection bandwidth accordingly.
Pair NVMe-oF with GPUDirect Storage for full bypass: the NIC DMAs straight into HBM, and the JBOF NVMe controller DMAs straight onto the wire. No host memory anywhere on the path.
Failure modes look different. Local NVMe failure takes down one GPU server; NVMe-oF target failure can affect every client mounting that target. Replication and pool-level redundancy matter.
For provisioning: think about NVMe-oF as software-defined block storage. The same JBOF can support 10 different namespaces with different sizes, IOPS guarantees, and replication levels. Most production deployments use a thin orchestration layer (NVMe Multipathing, ZNS, vendor-specific provisioner) on top of the raw protocol.

NVMe-oF is the protocol that makes "shared NVMe" a meaningful concept. Without it, NVMe is locked to whichever server it is plugged into.

What it actually does

The latency story

Why it changes how clusters are designed

When NVMe-oF replaces local

What this means in practice

See also