Lustre vs WekaFS

For a long time, "parallel filesystem for HPC" meant Lustre. Then GPFS, BeeGFS, and DAOS each carved out share. WekaFS arrived later, designed specifically for the IOPS demands of training workloads with millions of small files. The two that show up most in production AI clusters today are Lustre and WekaFS, and they are different enough that picking the wrong one for your workload costs you 2-5x in real bandwidth.

Architectural starting points

Lustre is a layered design. There is a metadata server tier (MDS) holding the directory tree and inode metadata, and a separate object storage server tier (OSS) holding the file data. Clients consult the MDS to open a file and learn which OSS targets hold which stripes, then read directly from those OSS targets. The MDS and OSS are separate machines (or pools), separately scaled and independently failure-prone.

WekaFS has no separate metadata tier. Every storage node holds a slice of the metadata and a slice of the data. The metadata is distributed via a consistent hash; opens, lookups, and stat calls land on whichever node owns that piece of the namespace. Reads and writes go directly to the data shards, also distributed.

The architecture difference shows up most clearly in metadata-heavy workloads. Image-classification training with one-file-per-image opens millions of files per epoch. Lustre with a single MDS bottlenecks on mdt_intent_open calls; the symptom is GPU utilization dropping to 30-40% during data loading. Lustre's DNE (Distributed Namespace) feature splits the namespace across multiple MDS, which helps but does not eliminate hot-spotting. WekaFS distributes metadata by design and tolerates the small-file pattern much better.

Bandwidth and IOPS comparison

For sequential bandwidth on similarly-sized clusters:

Lustre on HDD-backed OSTs: roughly 2-5 GB/s per OST, scales linearly. 16 OSTs gives 32-80 GB/s aggregate.
Lustre on NVMe-backed OSTs (newer deployments): roughly 8-15 GB/s per OST. 16 OSTs gives 128-240 GB/s.
WekaFS NVMe pool: roughly 5-10 GB/s per node, scales linearly. 32 nodes gives 160-320 GB/s.

For random-read IOPS (4K-256K mixed):

Lustre on HDD: roughly 5K-20K IOPS per OST. Sustained random read is much weaker than sequential.
Lustre on NVMe: roughly 100K-500K IOPS per OST. A real number, but still topology-bound.
WekaFS: 1M+ IOPS per node, sustained. The whole point of the design.

For metadata operations (open + stat per second):

Lustre single MDS: 50K-200K ops/sec. Hot-spot risk.
Lustre with DNE: 200K-1M ops/sec across multiple MDS.
WekaFS: 1M+ ops/sec, distributed naturally.

When each one is right

Lustre is right when:

You have a budget constraint and existing HPC operational experience.
Your workload is bandwidth-heavy with large files (pre-sharded TFRecord, WebDataset).
You can tolerate some metadata-side hot-spotting or run DNE.
You value open-source and vendor-independent operation.

WekaFS is right when:

Your workload is metadata-heavy (many small files, long directory trees).
You need consistent low latency on opens and stats.
IOPS-bound shuffle is the bottleneck (vision training with one-image-per-file).
The cost of WekaFS licensing is offset by the cost of pre-packing your dataset into large shards.

Other PFS options worth knowing

GPFS / IBM Spectrum Scale: mature, similar architecture to Lustre with stronger client cache coherence. Common in IBM-led AI clusters.
BeeGFS: open-source, simpler than Lustre, good for small-to-medium scale.
DAOS: Intel-led, NVMe and storage-class-memory focused, distributes metadata via CART. Fast but newer; smaller production base.

Operational considerations

Lustre has 20+ years of production scars. The tooling (lfs, lctl, lustre_rsync) is well-known. Failures and recovery are well-documented. The on-call burden is real but the playbook is mature.

WekaFS is a younger ecosystem. The tooling is good but less universally known. Vendor support is the path of least resistance for incident response. Snapshots, replication, and tiering features are first-class.

Storage failures matter for drain-and-replace at the GPU server level. A PFS that holds active dataset and active checkpoints needs replication and recovery faster than the typical training step. Both Lustre and WekaFS support replication (Lustre via FLR / mirroring, WekaFS via 2-of-3 or 3-of-4 erasure coding); the operational discipline to actually use it is what differs.

What this means in practice

Pre-pack datasets into large shards (100 MB-1 GB each) when running on Lustre. WebDataset, TFRecord, and MosaicML StreamingDataset all do this. The packing investment pays back as 5-10x faster training.
For metadata-heavy workloads on Lustre, enable DNE if available. If your cluster vendor does not support DNE, accept that small-file workloads will hit a wall around a few hundred GPUs.
For WekaFS, the operational discipline is around capacity. WekaFS pools fill up faster than Lustre because everything is on NVMe; budget capacity planning carefully.
Both can use GPUDirect Storage (cuFile-aware clients exist for both). Verify that your cluster's PFS client is GDS-enabled before assuming throughput numbers.
The fabric matters. Both PFS designs run on RDMA (IB or RoCE), and both need bandwidth budget on the same fabric as the compute. See topology-aware placement and rail-optimized fat-tree.

The choice is rarely "which is faster". It is "which matches my workload's pain point". Bandwidth-heavy plus large files: Lustre. Metadata-heavy plus small files: WekaFS.