FSDP vs ZeRO Sharding

Plain data parallelism replicates the entire model on every GPU and reduces gradients across them. That works until the model stops fitting on one GPU, which on H100 is somewhere around 14B parameters in mixed-precision (BF16 weights + FP32 master weights + Adam optimizer state ~ 16 bytes per parameter). Above that, the optimizer state alone exceeds HBM. ZeRO and FSDP shard that state across the data-parallel ranks so only 1/N of it lives on each GPU.

The three ZeRO stages

DeepSpeed's ZeRO ("Zero Redundancy Optimizer") defines three progressively more aggressive sharding levels.

ZeRO-1 shards the optimizer state. Adam keeps two extra full-precision copies of the parameters per GPU (momentum and variance), which is twice the model size in FP32. ZeRO-1 splits these across DP ranks, leaving each GPU with only 1/N of the optimizer state. The communication cost is one extra all-gather of the optimizer-related weight updates per step. For a 70B model with DP=64, optimizer state drops from 2 * 70B * 4 bytes = 560 GB per GPU to 8.75 GB per GPU. This single change frees enough memory that you can train a model that did not fit at all before.

ZeRO-2 also shards the gradients. After backward, the gradients are reduce-scattered (each rank ends up with the gradients for its slice of parameters) instead of all-reduced. Memory savings extend to the gradient tensors. Communication cost is the same as ZeRO-1 plus the reduce-scatter (which replaces, rather than adds to, the all-reduce of gradients).

ZeRO-3 shards the parameters themselves. Each rank holds only 1/N of the model weights at rest. Before each layer's forward pass, an all-gather brings the full layer's weights to every rank; after the layer's backward, a reduce-scatter pushes the gradients back to their owning shards and frees the gathered copies. PyTorch FSDP (Fully Sharded Data Parallel) is the PyTorch-native implementation of ZeRO-3 with the same memory and communication characteristics.

What FSDP costs

ZeRO-3 / FSDP runs one all-gather and one reduce-scatter per layer in a forward-backward pass. That is 2 collectives per layer instead of plain DP's 1 all-reduce per training step (across all layers). The total bytes moved are roughly 1.5x what plain DP moves (an all-gather plus a reduce-scatter is bandwidth-equivalent to an all-reduce plus an all-gather). The wins are entirely in memory.

The collectives in FSDP are small per call (one layer's parameters, typically 100 MB to 1 GB depending on model). NCCL routes them as ring or tree per the latency-bandwidth crossover. On NVLink-resident DP groups, they run at NVLink rate; on IB-resident DP groups, they run at IB rate. FSDP across IB is feasible but slower than the same model with TP inside the node and DP across nodes.

The activation memory savings are also real but usually overlooked. With FSDP, you do not need to store the full parameter copy; you only need it for the duration of one layer's forward and backward. Combined with activation checkpointing, the activation memory drops to a small constant per stage.

When FSDP fights with TP

FSDP and tensor parallelism want to shard the same tensors along potentially overlapping axes. The standard pattern is to use TP inside one NVLink domain (TP=8 inside HGX, TP=72 inside NVL72) and FSDP across DP ranks outside that domain. The TP all-reduces stay on NVLink; the FSDP all-gather and reduce-scatter cross IB at the DP boundary.

PyTorch's FSDP and DeepSpeed's ZeRO-3 both support this composition, sometimes called "hybrid sharding". The framework builds two communicators: a TP communicator at the intra-node level and an FSDP communicator at the inter-node level. The collectives on each communicator stay on the appropriate fabric tier.

Memory math worth memorizing

For a model with P parameters in mixed-precision Adam training:

Plain DP per GPU: P * 16 bytes (4 weight + 4 grad + 4 momentum + 4 variance, roughly).
ZeRO-1 per GPU: P * (4 + 4 + (4 + 4) / N) bytes.
ZeRO-2 per GPU: P * (4 + (4 + 4 + 4) / N) bytes.
ZeRO-3 (FSDP) per GPU: P * 16 / N bytes.

For a 70B model with DP=64, ZeRO-3 puts each GPU at 70B * 16 / 64 = 17.5 GB, which fits comfortably on an H100. Plain DP would have demanded 1.12 TB per GPU, which is impossible.

What this means in practice

Use FSDP / ZeRO-3 by default for training runs that do not fit a model on one GPU. PyTorch FSDP is production-ready; DeepSpeed ZeRO-3 has more knobs and is more complex but supports more configurations.
ZeRO-1 is a low-risk addition to any data-parallel run. It costs almost nothing in communication and saves 2x model size in optimizer state. Most training scripts can flip it on with no other changes.
Combine FSDP with tensor parallelism for very large models. TP for intra-node, FSDP for inter-node. This is the dominant configuration for Llama 2 70B / 405B and similar.
Activation checkpointing pairs naturally with FSDP. Without it, activation memory dominates the savings, and FSDP's wins evaporate.
For framework choice: PyTorch FSDP is the path of least resistance if you are already on PyTorch native. DeepSpeed gives you more options (ZeRO-Infinity offload to CPU/NVMe, partition-by-tensor-element control), at the cost of a heavier configuration surface.
For collective bucketing: FSDP's all-gather and reduce-scatter calls are smaller than plain DP's all-reduce and benefit even more from bucketing. PyTorch FSDP and DeepSpeed both bucket internally; the default 25 MB bucket size is usually correct.

FSDP is the parallelism axis that lets you grow the model without growing each GPU's HBM. Every other strategy assumes you have already paid this bill.