Skip to main content

GPUDirect Storage

GPUDirect Storage (GDS) lets an NVMe drive DMA training tensors straight into GPU HBM via cuFile, bypassing the CPU bounce buffer. Roughly 2x throughput vs traditional read.
Path
NVMe -> PCIe -> HBM, no CPU mem hop
API
cuFile (NVIDIA Magnum IO)
Throughput
~ 2x vs traditional read on NDR cluster

A training step needs roughly 100 KB to 100 MB of fresh data per GPU per step, depending on batch size and sequence length. At 1000 steps per minute, that adds up to several GB/s of sustained read per GPU. The path that read takes through the system, before it lands in HBM, decides whether storage feeds the GPUs or starves them.

The path that does the work

GPUDirect Storage (GDS) is the storage analog of GPUDirect RDMA: a kernel feature plus driver glue that lets an NVMe device perform DMA transfers directly to and from a GPU's HBM. The application calls the cuFile API (part of NVIDIA's Magnum IO stack), which translates the read or write into a PCIe DMA descriptor that the NVMe controller executes. The bytes flow NVMe -> PCIe switch -> GPU HBM without ever touching CPU memory.

without GDSwith GDS (cuFile)NVMeCPU memHBM+ PCIe hop+ PCIe hopNVMeHBMdirect DMA~ 6 GB/s per GPU~ 12 GB/s per GPUcuFile keeps the CPU and host memory off the data path entirely.

Without GDS, the same read takes the long way around. The NVMe driver issues a DMA into pinned host memory (one PCIe hop, NVMe-to-CPU). The application then either uses cudaMemcpy to push the buffer into HBM (a second PCIe hop, CPU-to-GPU) or relies on Unified Memory's page fault path, which has its own overhead. Two PCIe hops instead of one, plus host memory bandwidth as a bottleneck.

Why this matters at training scale

A modern PFS-backed training cluster can sustain roughly 100 GB/s of aggregate read into GPUs when GDS is enabled. Without GDS, the same cluster tops out at around 50 GB/s because host memory bandwidth and the CPU-to-GPU PCIe hop become the bottleneck. The 2x is real and measurable; NVIDIA publishes benchmarks at this ratio, and large-scale training runs have reported similar speedups.

The bandwidth math: a Sapphire Rapids server has roughly 300-400 GB/s of host memory bandwidth, which sounds like a lot until you realize that every byte of training data has to pass through it twice (in to host memory, out to GPU). For 8 GPUs each pulling 12 GB/s of data, that is 192 GB/s of host memory bandwidth consumed just to move data, which competes with all the other things the host wants to do (gradient sync staging, async checkpointing, framework metadata).

GDS bypasses all of that. The NVMe controller and the GPU communicate directly over the PCIe switch, leaving host memory free for everything else.

What it requires

GDS is finicky about setup, similar to GDR. The NVMe drive and the GPU must be on the same PCIe root complex (or behind a PCIe switch with peer-to-peer enabled). The kernel needs IOMMU configured in passthrough mode (or with explicit allow rules for the GPU's BAR1 region). The cuFile driver and the nvidia-fs kernel module must be loaded. The filesystem driver must be GDS-aware (most modern parallel filesystems and the latest XFS / Ext4 are; older NFS clients are not). Finally, the application must use the cuFile API rather than POSIX read; PyTorch, TensorFlow, and NVIDIA DALI all support cuFile pass-through.

When any of these is misconfigured, cuFile silently falls back to the bounce path and the only symptom is half the throughput. NVIDIA's gdscheck tool surfaces the configuration; running it during cluster acceptance is a 10-minute sanity check that catches most issues.

What this means in practice

  • For any PFS-backed training cluster, GDS is the difference between feeding the GPUs at line rate and feeding them at half rate. It is the assumption that performance benchmarks bake in.
  • The cuFile API is what frameworks call to use it. PyTorch's torch.utils.data does not use cuFile directly, but the NVIDIA DALI (Data Loading Library) wrapper does, and most production training pipelines pipe through DALI for this reason.
  • For storage choice, GDS works on local NVMe, on NVMe-oF over RDMA, and on parallel filesystems (Lustre, WekaFS, GPFS, BeeGFS, DAOS) when paired with a GDS-aware client. See parallel filesystems for AI.
  • For inter-node storage paths, the analog is NVMe over Fabrics, which extends NVMe verbs over RDMA and pairs naturally with GDS.
  • For checkpoint writes, GDS works the other direction (HBM to NVMe) and is a meaningful win for sharded checkpoint writes when each rank writes its own GB of state concurrently.
  • Debug GDS with gdscheck, nvidia-smi -q | grep BAR1, and the cufile.log if your application enables it. cuFile emits a log line per file open indicating whether GDS or fallback path is in use.

GDS is one of those features where the cost of having it is roughly zero (a few hours of cluster setup) and the cost of not having it is a 2x throughput haircut you may not notice until your training run is already weeks behind.

See also

Updated 2026-05-10