Scale AtlasChapter 4 of 87 termsUpdated 2026-05-10

Storage

Feed the GPUs without starving them. Storage is a memory hierarchy: HBM at the top, NVMe and parallel filesystems in the middle, object stores at the bottom. Reads stream up the hierarchy on every epoch; writes flow down on every checkpoint. The job is to match each data lifecycle to the cheapest tier that still keeps the GPUs fed.

Checkpoint Sharding

Sharded checkpointing has each rank write its own slice of model state in parallel. A 1 TB checkpoint at DP=64 becomes 16 GB per rank, written concurrently.

Total1 TB+ for trillion-param modelsPer-ranktotal / world_size, written concurrentlyAsyncbackground write overlaps next training step

Dataset Shuffling at Scale

Streaming shuffle algorithms approximate global random shuffle without holding the dataset in memory. Shard-and-buffer or window-shuffle keep IO predictable across epochs.

Constraintdataset > RAM, > local NVMeAlgorithmsshard-shuffle + per-rank window bufferQualityshuffle radius vs IO budget tradeoff

GPUDirect Storage

GPUDirect Storage (GDS) lets an NVMe drive DMA training tensors straight into GPU HBM via cuFile, bypassing the CPU bounce buffer. Roughly 2x throughput vs traditional read.

Lustre vs WekaFS

Lustre is the open-source veteran with metadata server + OSS layers and POSIX semantics. WekaFS is the high-IOPS challenger with distributed metadata and an NVMe-only pool.

NVMe over Fabrics

NVMe-oF exposes remote NVMe drives over RDMA so applications use the same NVMe verbs they would for local SSD. Latency is +1-2 us, bandwidth is fabric-limited.

Parallel Filesystems for AI Training

Parallel filesystems (Lustre, GPFS, WekaFS, BeeGFS, DAOS) stripe data across many servers so aggregate bandwidth scales linearly with server count. Required above ~ 10 GB/s sustained read.

S3 Tier for Training

Object stores hold petabytes cheaply but ship at ~ 1-5 GB/s per connection. A warm NVMe cache fronts S3 for active epochs; cold tiers stay in S3.