Skip to main content

S3 Tier for Training

Object stores hold petabytes cheaply but ship at ~ 1-5 GB/s per connection. A warm NVMe cache fronts S3 for active epochs; cold tiers stay in S3.
Cost
~ $20/TB-month S3 vs ~ $200/TB-month NVMe
Throughput
~ 1-5 GB/s per connection sustained
Pattern
S3 cold, NVMe cache warm, prefetch staging

A 100 PB dataset is small for the cosmic scale of training runs and impossible for the practical scale of NVMe SSDs. The numbers force a decision. NVMe at $200 per TB per month makes 100 PB cost $20M/month just for the storage, before electricity. S3 at $20 per TB per month makes the same dataset cost $2M/month. The 10x cost gap is exactly why every large training pipeline ends with object store at the bottom.

What S3 ships

The headline number for S3 (and its peers GCS, Azure Blob, OSS) is per-connection sustained throughput. A single connection from one EC2 instance to one S3 bucket sustains roughly 1-5 GB/s, depending on the instance type and S3's current rate-limiting state. With multiple parallel connections (the standard pattern), aggregate throughput can climb to tens or hundreds of GB/s, but the per-bucket rate limit eventually kicks in (S3 throttles at roughly 5,500 PUT/s and 10,000 GET/s per prefix without explicit prefix sharding).

HBM4.8 TB/sexpensive (HBM)NVMe cache~ 50 GB/s$200 / TB-monthS3 / object~ 5 GB/s$20 / TB-montheach tier is ~ 10x slower and ~ 10x cheaper than the one aboveS3 holds the cold dataset; NVMe cache fronts active epochs; HBM holds the active batch.

The bandwidth gap between S3 and NVMe is roughly 10x. The bandwidth gap between NVMe and HBM is another 10x to 100x. So a three-tier hierarchy maps cleanly: S3 holds the cold dataset, NVMe holds the active epoch's worth, HBM holds the active batch. Each tier handles roughly the same wall-clock window of data flow, just at vastly different bandwidths and costs.

The cache-front pattern

The standard production pattern is "S3 plus warm cache". The dataset lives in S3 in its full form (raw images, tokenized text shards, video chunks, whatever). When a training run starts, a prefetcher pulls the next epoch's data from S3 into a warm cache (NVMe-backed parallel filesystem like WekaFS or Lustre, or a FUSE layer like Mountpoint-S3). The training process reads from the warm cache at 50+ GB/s. After the epoch, that cache slot can be freed.

This pattern works because training reads the dataset roughly sequentially (per epoch). The cache only needs to hold one epoch's working set, not the whole dataset. For a 100 PB dataset trained one epoch at a time over a 30-day run, the cache holds 100 PB / 30 = 3.3 PB of warm data at any moment. That fits in a moderately-sized PFS at $200/TB-month for $660K/month, plus $20/TB-month for the cold archive. Compare to $20M/month for an all-NVMe approach: 30x cheaper.

Read patterns that hit and miss

The cache-front pattern works best when:

  • The dataset is read in epochs (most large training runs).
  • The order within an epoch is shuffleable but not random over the entire dataset (see dataset shuffling at scale).
  • Pre-shuffled "chunked" formats (WebDataset, MosaicML StreamingDataset, TFRecord) are used so S3 reads are sequential per chunk.

It works poorly when:

  • The dataset is randomly accessed across the entire 100 PB on every step. The cache miss rate is too high; you are effectively reading from S3 at 5 GB/s.
  • Files are tiny (under 1 MB each) and the per-request overhead of S3 dominates. Pre-pack into shards.
  • The training is iteration-bound rather than throughput-bound (small batches, small models). The cache adds latency that you are not amortizing across enough work.

Mountpoint-S3, FUSE layers, and FS-fronting

A growing pattern is to expose S3 as a POSIX filesystem via a FUSE layer (Mountpoint-S3 from AWS, the open-source s3fs, custom GCS/OSS mounts). The application sees a normal directory tree; the FUSE layer handles the S3 calls underneath, with local caching. This avoids application-level changes when migrating from PFS to S3.

The catch is that FUSE layers add latency (kernel bounce + user-space FUSE handler) and have inconsistent semantics for things like file locks, mtimes, and consistency guarantees. They work for read-heavy training pipelines but should not be used for production checkpoint writes; for those, use the S3 SDK directly or a checkpoint-aware library like checkpoint sharding tools.

Egress costs and data gravity

Cloud object stores have one cost most architects forget: egress. Reading 100 PB out of AWS S3 to a non-AWS GPU cluster costs roughly $0.09/GB, or $9M for the full read. Reading the same 100 PB out of intra-region S3 to in-region EC2 is free. This is why the "training data lives in the same cloud as the GPUs" rule exists; cross-cloud or cross-region training is dominated by egress fees.

For private GPU clusters (NeoCloud, on-prem), the analog is the bandwidth between the GPU cluster's private network and whichever S3-compatible object store you run (MinIO, Ceph, vendor-specific). The bandwidth is usually plenty (10-100 GB/s of dedicated DC interconnect), but the protocol and rate limits still apply.

What this means in practice

  • Use S3 (or equivalent object store) for the cold dataset tier. Anything you do not read in the next 24 hours belongs there.
  • Front S3 with a warm cache sized for one epoch (or a few epochs, if shuffle radius requires it). NVMe-backed PFS or a dedicated cache pool both work.
  • Pre-pack tiny files into 100 MB-1 GB shards before uploading to S3. WebDataset, TFRecord, MosaicML StreamingDataset all do this. The reduction in S3 request count alone is worth the packing investment.
  • For dataset shuffling at scale: the shuffle radius interacts with the cache size. If you need a global shuffle, you cannot cache only one epoch.
  • For checkpoints, S3 is fine for cold storage of completed runs. Active checkpoints stay on PFS; the rotation policy moves them to S3 after a few days.
  • Egress costs are real; budget them. Cross-region or cross-cloud training is rarely a cost win even if the GPU rates look better.

S3 is the cheapest tier in the hierarchy. It is also the slowest. The architecture trick is to use it for what it is good at (cheap durable cold storage), not for what it is bad at (feeding GPUs at line rate).

See also

Updated 2026-05-10