Skip to main content

NVMe over Fabrics

NVMe-oF exposes remote NVMe drives over RDMA so applications use the same NVMe verbs they would for local SSD. Latency is +1-2 us, bandwidth is fabric-limited.
Protocol
NVMe over RDMA (RoCE or IB) or TCP
Latency
local NVMe ~ 10 us, NVMe-oF ~ 11-12 us
Use
shared NVMe pools, ephemeral scratch

A GPU server with 8 H100s and 8 NVMe drives is a sensible starting point until you realize that some workloads need 100 TB of fast storage and others need 1 TB. Local NVMe is too coarse-grained for elastic provisioning. NVMe over Fabrics (NVMe-oF) decouples the drive from the server: the drive lives in a separate JBOF (just a bunch of flash) chassis, the GPU server reads it over RDMA, and the application sees the same NVMe queue interface it would for local SSD.

What it actually does

NVMe-oF is a small layer that wraps NVMe submission and completion queues in RDMA messages. The application calls pread() or uses cuFile, the kernel's NVMe driver enqueues a work request, the NIC packages it as an RDMA write to the target, the target's NIC delivers it to its NVMe controller, and the SSD performs the I/O. The completion queue entry comes back the same way. From the application's perspective, this is indistinguishable from a local NVMe read; from the fabric's perspective, it is just another RDMA message.

local NVMe (PCIe)GPU serverNVMe SSDPCIe Gen5~ 10 us latency~ 14 GB/s per driveNVMe-oF (RDMA)GPU serverNICRDMANICNVMe SSD~ 11-12 us latencyfabric-limited bandwidthsame NVMe verbs at the application layerdifferent transport in betweenNVMe-oF lets you build pools of remote SSD that look local to the application.

The wire protocol comes in three flavors. NVMe-oF over RDMA is the dominant production form: it runs on InfiniBand or RoCE and uses the same RDMA verbs as GPUDirect RDMA. NVMe-oF over Fibre Channel is for storage networks that already speak FC. NVMe-oF over TCP exists but is slower and rarely used for AI training.

The latency story

Local NVMe inside the same PCIe domain runs roughly 10 microseconds for a 4K random read, dominated by the SSD's controller processing time. NVMe-oF adds the round-trip time of one RDMA exchange, which on a tuned IB or RoCE fabric is roughly 1-2 microseconds. So NVMe-oF latency is 11-12 microseconds, a 10-20% increase over local. Sequential bandwidth tops out at the fabric line rate divided by RDMA efficiency, which on NDR (50 GB/s per port) gives roughly 45 GB/s sustained per connection.

The latency increase is small enough that for most AI training workloads (which are bandwidth-bound, not latency-bound), NVMe-oF is indistinguishable from local. The bandwidth is what changes: if your storage fabric has more bandwidth per server than your local PCIe lanes provide, NVMe-oF can actually be faster than local for some workloads.

Why it changes how clusters are designed

The big win of NVMe-oF is provisioning flexibility. With local NVMe, the storage capacity per GPU server is fixed at build time (e.g., 8x 7.68 TB = 61 TB per server). With NVMe-oF, the same 61 TB lives in a JBOF chassis that can be carved up: one GPU server might mount 100 TB for a large dataset, another might mount 5 TB for a small one. The total fleet capacity stays the same; the per-server slice is elastic.

This is what enables shared NVMe pools backing parallel filesystems. WekaFS, for example, builds its global namespace on top of NVMe-oF: every storage node exports its NVMe drives as NVMe-oF targets, and every client mounts them as a single pool. Lustre vs WekaFS covers the architectural differences in detail.

When NVMe-oF replaces local

The decision tree is roughly:

  • For ephemeral scratch space (training intermediate buffers, cache for object-store reads): local NVMe is simpler and just as fast.
  • For shared dataset storage that multiple jobs need: NVMe-oF as the backing layer for a parallel filesystem is the standard pattern.
  • For checkpoints that need to outlive the GPU server: NVMe-oF with replication, or a parallel filesystem on top of it.
  • For long-term archival: object store (S3 tier for training), not NVMe.

What this means in practice

  • NVMe-oF over RDMA is the production form. Use IB or RoCE, not TCP, unless you have a specific reason to avoid RDMA.
  • The fabric tier matters. NVMe-oF on a rail-optimized fat-tree will share the IB fabric with collective traffic; provision the bisection bandwidth accordingly.
  • Pair NVMe-oF with GPUDirect Storage for full bypass: the NIC DMAs straight into HBM, and the JBOF NVMe controller DMAs straight onto the wire. No host memory anywhere on the path.
  • Failure modes look different. Local NVMe failure takes down one GPU server; NVMe-oF target failure can affect every client mounting that target. Replication and pool-level redundancy matter.
  • For provisioning: think about NVMe-oF as software-defined block storage. The same JBOF can support 10 different namespaces with different sizes, IOPS guarantees, and replication levels. Most production deployments use a thin orchestration layer (NVMe Multipathing, ZNS, vendor-specific provisioner) on top of the raw protocol.

NVMe-oF is the protocol that makes "shared NVMe" a meaningful concept. Without it, NVMe is locked to whichever server it is plugged into.

See also

Updated 2026-05-10