Parallel Filesystems for AI Training

A single NVMe SSD pushes about 14 GB/s sequential read on a good day. A 1024-GPU cluster wants several hundred GB/s of sustained dataset read. The arithmetic does not work out for any local-storage solution. The fix is to spread the data across many servers and have many clients read concurrent slices. That is a parallel filesystem.

What striping buys you

A parallel filesystem (PFS) chunks each file into stripes and distributes the stripes across N storage servers (called OSS in Lustre, NSDs in GPFS, just "storage nodes" in WekaFS and DAOS). Each client mounting the filesystem can read different stripes from different servers in parallel, and the aggregate bandwidth scales linearly with N. If each storage server can sustain 10 GB/s, then 8 servers give you 80 GB/s aggregate, 16 servers give 160 GB/s, and 64 servers give 640 GB/s. The math is brutally simple, which is why every large training cluster in production runs some form of PFS.

The metadata side is where things get interesting. Lustre and GPFS use one or more dedicated metadata servers (MDS / MDS pool) that hold the directory tree and inode information. WekaFS, BeeGFS, and DAOS distribute metadata across the same nodes that hold data. Both approaches work; they have different scaling and failure characteristics. See Lustre vs WekaFS for the head-to-head.

When you need a PFS

The threshold is roughly 10 GB/s sustained read per cluster. Below that, a few NVMe drives in each GPU server (with NFS or BeeGFS-On-Demand for sharing) is enough. Above that, you need a real PFS. For a 1024-GPU H100 cluster running a typical dense LLM training mix, sustained read demand is roughly 100-200 GB/s; for an MoE training with high IOPS shuffle, it can be 500 GB/s or more.

The exact requirement depends on the workload. A vision transformer training on a 100M-image dataset is bandwidth-heavy. An LLM training on a tokenized text corpus is metadata-heavy (many small files, lots of opens). A chemistry foundation model on a custom format is somewhere in between. The PFS choice and tuning depends on which axis your workload stresses.

How a PFS feeds a training cluster

The standard topology is to put a PFS on its own set of storage nodes connected to the same fabric as the GPU compute. For an InfiniBand cluster, the storage servers connect via NDR or HDR ports to the same leaf switches as the compute (or to dedicated storage leaves, depending on the design). Reads come from the PFS to the GPU server's NIC, then through GPUDirect Storage into HBM. Writes (e.g., checkpoint sharding) flow the same way in reverse.

A common gotcha: the PFS shares the IB fabric with the collective traffic. If a checkpoint write happens during a training step, it competes with the gradient all-reduce for fabric bandwidth. Modern PFS implementations support QoS / rate limiting, and most production training jobs schedule checkpoints during quiet periods (after the optimizer step, before the next forward pass starts).

What stresses each PFS axis

Sequential read bandwidth: sum of per-server bandwidths. Cap is line rate of the storage fabric. Lustre, GPFS, and WekaFS all hit this.
Random read IOPS: dominates when files are small (under 1 MB) and not cached. WekaFS's NVMe-everywhere design wins here; Lustre on HDD storage targets is much weaker.
Metadata operations: opens, stats, directory traversals. Distributed-metadata systems (WekaFS, DAOS) win for many small files; Lustre is improving with ZFS-backed MDS pools but still limited by single-MDS hot spots without DNE (Distributed Namespace).
Checkpoint write throughput: sum of per-server write bandwidths. Sharded writes (checkpoint sharding) parallelize across all OSS.

What this means in practice

Default for any cluster over 256 GPUs: provision a PFS sized for your peak sustained read demand, typically 1-2x the per-rack compute fabric bandwidth.
Number of storage servers: aggregate bandwidth target divided by per-server sustained bandwidth. For 200 GB/s target and 10 GB/s per server, that is 20 storage servers. Round up for redundancy.
Pair with GPUDirect Storage for full bypass; NCCL and DALI both support this.
For metadata pressure (many small files): tokenized text datasets, image classification with one-image-per-file. Pre-pack into shards of 100 MB each (TFRecord, MosaicML StreamingDataset, WebDataset) to push metadata pressure to a tractable place. The PFS then sees a few thousand large files instead of a few million tiny ones.
For topology-aware placement: storage nodes typically attach to dedicated storage leaves so they do not compete with compute traffic, but they still share the spine. Bisection bandwidth budgeting matters.

A parallel filesystem is the layer that makes "throw more data at it" a feasible strategy. Without it, you are back to local NVMe and the bandwidth ceiling is per-server, not per-cluster.

What striping buys you

When you need a PFS

How a PFS feeds a training cluster

What stresses each PFS axis

What this means in practice

See also