Sequence Parallelism

When the context window grew from 2K to 8K to 32K to 128K to 1M tokens, activation memory grew linearly with it. A 70B model at 128K context with TP=8 still spends ~ 200 GB on activations per forward pass. Tensor parallelism sharded the weights but left the activations replicated. Sequence parallelism shards the activations along the token axis so the activation memory shrinks too.

What gets sharded

Sequence parallelism splits a single batch's token sequence across GPUs. With SP=4 and a 16K context, GPU 0 holds tokens 0-3999, GPU 1 holds tokens 4000-7999, and so on. Activations for the LayerNorm, the residual adds, and the dropout are sharded along the sequence axis, since these are pointwise operations that touch each token independently.

The hard part is the attention block, which by definition wants every token to see every other token. The two practical approaches are Megatron's "tensor parallel with sequence parallel" (Megatron-SP) and the more general "ring attention" (also called context parallelism or CP).

Megatron-SP: shard between TP boundaries

Megatron-SP is the simpler variant: it shards activations along the sequence axis only outside of the TP-collective regions. The matmuls inside Attention (QKV projection, output projection) and MLP run sharded along the hidden dimension via TP, and the activations between them stay sharded along the sequence dimension via SP. The transition between the two is a reduce-scatter (instead of an all-reduce), which costs the same bytes as a plain all-reduce but ends with sequence-sharded activations instead of replicated ones.

The Megatron-SP win is that TP=8 SP=8 is achievable inside one HGX H100, and activation memory drops by another factor of 8 on top of the TP weight savings. This is the configuration that makes 32K to 128K context feasible at 70B scale on dense models.

Ring attention: full attention across SP=N

Ring attention is more general but also more expensive in communication. With SP=N, each GPU holds 1/N of the sequence's K and V tensors. To compute attention, GPU 0 needs to attend to all N chunks of K/V. Ring attention has each GPU compute attention against its local K/V first, then pass the K/V chunks to the next GPU in the ring, compute attention against that chunk, pass it on, and so on. After N-1 hops, every GPU has computed attention against all K/V chunks and the sequence-sharded result is correct.

The communication cost is (N-1) * size(K/V chunk) bytes per attention layer, sent peer-to-peer around the ring. For long sequences this is a real cost, but it scales linearly in N and the messages are point-to-point rather than collective, so they tolerate IB latency better than all-reduce.

Ring attention is what enables million-token contexts at production scale. Llama 3's long-context release uses a CP variant; Anthropic's published context-extension recipes use ring attention internally. For very long contexts (1M+ tokens), ring attention is effectively the only path that fits memory.

What SP costs

The activation memory savings are the main return. With SP=N:

LayerNorm/dropout/residual activations: scale by 1/N.
Attention K/V cache: scale by 1/N.
Attention output activations: scale by 1/N.
Megatron-SP: replaces TP all-reduce with reduce-scatter, same byte count.
Ring attention: adds N-1 K/V hops per attention layer.

The communication overhead in Megatron-SP is roughly zero relative to TP-only. In ring attention, the K/V hops are visible in the wall-clock when the sequence is short enough that local attention dominates; for long sequences, the K/V hops are amortized across the per-token attention compute and become a small fraction.

What this means in practice

For dense models at 32K to 128K context: pair tensor parallelism with Megatron-SP. TP=8 SP=8 inside one HGX H100 keeps everything on NVLink.
For very long context (over 256K tokens): use ring attention or context parallelism. Megatron-LM 24.05 and PyTorch's distributed.tensor.parallel support both. The ring extends across the SP/CP group, which can span nodes if needed.
The communication cost of ring attention is forgiving on bandwidth but unforgiving on latency. If the ring spans nodes, the per-hop latency adds to every attention layer. Keep the ring inside one NVLink domain when possible.
SP composes with 3D parallelism. The standard pattern for Llama 3 405B-scale runs is TP=8 SP=8 PP=4 DP=N, which works for million-token contexts on a B200/NVL72 build.
Activation checkpointing still helps. SP reduces the activation memory by SP degree; activation checkpointing reduces it by another factor by recomputing instead of storing. The two are independent and both worth using for very long sequences.
For context-window-heavy inference (long-context serving), the same SP/CP ideas apply. The compute pattern is different (decode is per-token, not per-sequence), but the memory pressure of the K/V cache is the same and ring-style sharding extends naturally.

Sequence parallelism is the parallelism axis you reach for when context length, not parameter count, is what stops fitting. It is the backbone of every long-context training run since 2024.