Expert Parallelism

A mixture-of-experts (MoE) model replaces the dense MLP block with a pool of N experts plus a router that picks the top-K experts for each token. The model has N times more parameters than a dense equivalent but runs roughly the same flops per token (since only K of N experts are active). Expert parallelism is the placement strategy that puts different experts on different GPUs.

What gets sharded and how the data moves

Expert parallelism (EP) shards the expert pool. With 8 experts and EP=8, each GPU owns one expert's weights. The router on each GPU computes routing weights for its tokens, then needs to send each token to the GPUs that own the chosen experts.

That "send each token to its expert" is an all-to-all collective. Each GPU sends a different subset of its tokens to each other GPU, and receives a different subset of the global token pool that is destined for its local expert. The expert FFN computes on those received tokens. Then a second all-to-all sends the FFN outputs back to the GPUs that originated them, where they are combined with the other-expert outputs (weighted by the router) to form the final layer output.

That is two all-to-alls per MoE layer. For a model with M MoE layers, that is 2M all-to-alls per training step per direction (forward + backward). On Mixtral-8x7B (32 layers, all MoE), that is 64 all-to-alls per step. On DeepSeek V3 (61 layers, mostly MoE), it is north of 100.

Why EP is bisection-bandwidth-bound

Each all-to-all moves roughly tokens_per_step * hidden_dim * bytes_per_token of data across the EP group. For a 70B-class MoE at 4K context and DP=64, this is on the order of 100s of MB per all-to-all per layer. Multiplied by two all-to-alls per layer and tens of layers per step, the total all-to-all traffic per step rivals the gradient all-reduce traffic of the dense parts.

The all-to-all pattern is what bounds EP: every GPU sends to every other GPU in the EP group, which means the traffic crosses the worst-case cut of the fabric. On a rail-optimized fat-tree, the bisection bandwidth is what determines how fast the all-to-all completes. For a 1024-GPU H100 cluster on NDR with 1:1 oversubscription, that is roughly 12.8 TB/s, which gets divided across all the concurrent all-to-alls.

If the EP group spans many nodes, the all-to-all crosses IB and runs at IB rates. If the EP group fits inside one NVL72, the all-to-all stays on NVLink and runs ~ 10x faster. The placement decision matters more for EP than for any other parallelism axis.

Top-K routing and capacity factors

MoE typically uses top-2 routing: each token sends to its 2 chosen experts (out of 8 in Mixtral, out of 256 in DeepSeek V3). The router's choice introduces load imbalance: some experts get more tokens than others. To bound this, frameworks set a "capacity factor" (typically 1.0 to 1.25) that caps the maximum tokens per expert at (tokens / num_experts) * capacity_factor. Tokens that exceed an expert's capacity are dropped (their MoE output replaced by zero) or routed to a fallback.

The capacity factor matters for the all-to-all traffic. Higher capacity = more padding (some experts get fewer than capacity tokens, but the all-to-all moves the full capacity per expert) = more bytes on the wire. Lower capacity = more drops = potentially lower model quality. Practical training at scale tunes this between 1.0 and 1.25.

How EP combines with TP and DP

EP shards the expert pool. Inside one expert, you might still want tensor parallelism if the expert FFN is too large for one GPU. So TP * EP applies on the MoE block. Outside the MoE block (attention, LayerNorm), TP applies but EP does not. The framework manages two separate communicators: a TP communicator for intra-expert sharding and an EP communicator for cross-expert all-to-all.

DP applies on top of EP. Each DP replica runs its own EP group. The standard configuration for a large MoE is world_size = TP * EP * DP, where TP=8 covers the dense parts inside the node, EP=8-64 covers the experts (typically across nodes within a pod), and DP fills the rest. See 3D parallelism for the combined picture.

What this means in practice

EP is the right tool when your model is a mixture-of-experts. It is not a dense-model alternative to TP or PP.
The dominant cost of EP is bisection bandwidth, not local compute. Place EP groups within one fabric pod (rail-optimized fat-tree leaf or one NVL72 rack) to keep the all-to-alls fast.
For very large MoE (256+ experts), EP will span the cluster. The latency of all-to-all across many nodes becomes a real fraction of the step time. NCCL's all-to-all is bisection-bound; framework-level overlap of all-to-all with compute helps but does not eliminate the cost.
Capacity factor and top-K are model design choices, but they directly affect the all-to-all bytes. Profile both. DeepSpeed-MoE and Megatron-LM both expose these as training config.
Combine with sequence parallelism for long-context MoE. Each SP rank holds a slice of the sequence, and the all-to-all operates on the local slice.
For inference: most MoE inference uses static expert placement (expert i lives on GPU i) and a single all-to-all per token batch. Throughput-tuned inference uses much higher EP degrees than training.
Debug EP all-to-all performance with NCCL profiling: each all-to-all has a profile entry. If the all-to-all is much slower than the underlying bisection bandwidth predicts, the cause is usually fragmented placement (EP group spanning multiple fabric pods) or capacity factor inflation.

EP is the parallelism axis that exists because MoE exists. Its cost is the all-to-all. Its placement rule is the bisection.