Scale AtlasChapter 4 of 87 termsUpdated 2026-05-10
Storage
Feed the GPUs without starving them. Storage is a memory hierarchy: HBM at the top, NVMe and parallel filesystems in the middle, object stores at the bottom. Reads stream up the hierarchy on every epoch; writes flow down on every checkpoint. The job is to match each data lifecycle to the cheapest tier that still keeps the GPUs fed.
Checkpoint Sharding
Sharded checkpointing has each rank write its own slice of model state in parallel. A 1 TB checkpoint at DP=64 becomes 16 GB per rank, written concurrently.
Dataset Shuffling at Scale
Streaming shuffle algorithms approximate global random shuffle without holding the dataset in memory. Shard-and-buffer or window-shuffle keep IO predictable across epochs.
GPUDirect Storage
GPUDirect Storage (GDS) lets an NVMe drive DMA training tensors straight into GPU HBM via cuFile, bypassing the CPU bounce buffer. Roughly 2x throughput vs traditional read.
Lustre vs WekaFS
Lustre is the open-source veteran with metadata server + OSS layers and POSIX semantics. WekaFS is the high-IOPS challenger with distributed metadata and an NVMe-only pool.
NVMe over Fabrics
NVMe-oF exposes remote NVMe drives over RDMA so applications use the same NVMe verbs they would for local SSD. Latency is +1-2 us, bandwidth is fabric-limited.
Parallel Filesystems for AI Training
Parallel filesystems (Lustre, GPFS, WekaFS, BeeGFS, DAOS) stripe data across many servers so aggregate bandwidth scales linearly with server count. Required above ~ 10 GB/s sustained read.
S3 Tier for Training
Object stores hold petabytes cheaply but ship at ~ 1-5 GB/s per connection. A warm NVMe cache fronts S3 for active epochs; cold tiers stay in S3.