NVMe over Fabrics
A GPU server with 8 H100s and 8 NVMe drives is a sensible starting point until you realize that some workloads need 100 TB of fast storage and others need 1 TB. Local NVMe is too coarse-grained for elastic provisioning. NVMe over Fabrics (NVMe-oF) decouples the drive from the server: the drive lives in a separate JBOF (just a bunch of flash) chassis, the GPU server reads it over RDMA, and the application sees the same NVMe queue interface it would for local SSD.
What it actually does
NVMe-oF is a small layer that wraps NVMe submission and completion queues in RDMA messages. The application calls pread() or uses cuFile, the kernel's NVMe driver enqueues a work request, the NIC packages it as an RDMA write to the target, the target's NIC delivers it to its NVMe controller, and the SSD performs the I/O. The completion queue entry comes back the same way. From the application's perspective, this is indistinguishable from a local NVMe read; from the fabric's perspective, it is just another RDMA message.
The wire protocol comes in three flavors. NVMe-oF over RDMA is the dominant production form: it runs on InfiniBand or RoCE and uses the same RDMA verbs as GPUDirect RDMA. NVMe-oF over Fibre Channel is for storage networks that already speak FC. NVMe-oF over TCP exists but is slower and rarely used for AI training.
The latency story
Local NVMe inside the same PCIe domain runs roughly 10 microseconds for a 4K random read, dominated by the SSD's controller processing time. NVMe-oF adds the round-trip time of one RDMA exchange, which on a tuned IB or RoCE fabric is roughly 1-2 microseconds. So NVMe-oF latency is 11-12 microseconds, a 10-20% increase over local. Sequential bandwidth tops out at the fabric line rate divided by RDMA efficiency, which on NDR (50 GB/s per port) gives roughly 45 GB/s sustained per connection.
The latency increase is small enough that for most AI training workloads (which are bandwidth-bound, not latency-bound), NVMe-oF is indistinguishable from local. The bandwidth is what changes: if your storage fabric has more bandwidth per server than your local PCIe lanes provide, NVMe-oF can actually be faster than local for some workloads.
Why it changes how clusters are designed
The big win of NVMe-oF is provisioning flexibility. With local NVMe, the storage capacity per GPU server is fixed at build time (e.g., 8x 7.68 TB = 61 TB per server). With NVMe-oF, the same 61 TB lives in a JBOF chassis that can be carved up: one GPU server might mount 100 TB for a large dataset, another might mount 5 TB for a small one. The total fleet capacity stays the same; the per-server slice is elastic.
This is what enables shared NVMe pools backing parallel filesystems. WekaFS, for example, builds its global namespace on top of NVMe-oF: every storage node exports its NVMe drives as NVMe-oF targets, and every client mounts them as a single pool. Lustre vs WekaFS covers the architectural differences in detail.
When NVMe-oF replaces local
The decision tree is roughly:
- For ephemeral scratch space (training intermediate buffers, cache for object-store reads): local NVMe is simpler and just as fast.
- For shared dataset storage that multiple jobs need: NVMe-oF as the backing layer for a parallel filesystem is the standard pattern.
- For checkpoints that need to outlive the GPU server: NVMe-oF with replication, or a parallel filesystem on top of it.
- For long-term archival: object store (S3 tier for training), not NVMe.
What this means in practice
- NVMe-oF over RDMA is the production form. Use IB or RoCE, not TCP, unless you have a specific reason to avoid RDMA.
- The fabric tier matters. NVMe-oF on a rail-optimized fat-tree will share the IB fabric with collective traffic; provision the bisection bandwidth accordingly.
- Pair NVMe-oF with GPUDirect Storage for full bypass: the NIC DMAs straight into HBM, and the JBOF NVMe controller DMAs straight onto the wire. No host memory anywhere on the path.
- Failure modes look different. Local NVMe failure takes down one GPU server; NVMe-oF target failure can affect every client mounting that target. Replication and pool-level redundancy matter.
- For provisioning: think about NVMe-oF as software-defined block storage. The same JBOF can support 10 different namespaces with different sizes, IOPS guarantees, and replication levels. Most production deployments use a thin orchestration layer (NVMe Multipathing, ZNS, vendor-specific provisioner) on top of the raw protocol.
NVMe-oF is the protocol that makes "shared NVMe" a meaningful concept. Without it, NVMe is locked to whichever server it is plugged into.
See also
Updated 2026-05-10