Tensor Parallelism

A transformer is mostly matmuls. Two big ones per layer (Attention's QKV projection and output projection) and two more in the MLP. When the model gets too large to fit one matmul on one GPU, the answer is to split the matmul itself across GPUs. That is tensor parallelism.

What gets sharded

Take a single linear layer Y = X · W, where X is shape [B, H] and W is shape [H, H']. Tensor parallelism splits W either column-wise (each GPU holds a slice W[:, k:k+H'/N]) or row-wise (each GPU holds W[k:k+H/N, :]). Megatron-LM's canonical pattern is "column-parallel then row-parallel": the first matmul of the MLP shards columns (so each GPU produces a partial output without communication), the second matmul shards rows (so each GPU's partial output reduces with an all-reduce at the end).

The bookkeeping at the boundaries is what TP costs you. After every column-parallel matmul, the next operation that uses the full output (a non-linearity, a bias add, the next matmul if it expects the full hidden dimension) needs the partials concatenated or summed. Megatron's trick is to fuse column-parallel + row-parallel back-to-back so the only collective is one all-reduce per pair of matmuls. The MLP block ends up with one all-reduce per forward pass, and the backward pass mirrors that with one all-reduce of the gradients.

Why TP is NVLink-only

The all-reduce per matmul is small in absolute terms (one matmul output of one transformer layer, typically a few MB on H100), but it happens at every layer in the forward pass and again in the backward pass. For a 32-layer model, that is 64 all-reduces per training step per TP group, plus the same number of attention all-reduces. The latency of every collective adds directly to the step time.

NVLink runs these at ~360 GB/s effective per-direction bandwidth on H100 with sub-microsecond latency. InfiniBand NDR runs the same all-reduce at ~50 GB/s per port and ~1-2 microseconds of per-message latency floor. The gap on small TP messages is dominated by the latency floor, not the bandwidth, and the resulting slowdown is roughly 50x for typical TP collective sizes. This is the placement rule: TP groups must fit inside one NVLink domain. See topology-aware placement.

The practical TP degree is bounded by the size of the NVLink domain. On HGX H100, that is 8 GPUs (TP=8 max). On NVL72, the limit goes up to 72, though most workloads still pick TP=8 or TP=16 for memory and addressing reasons.

When TP earns its keep

TP earns its place when the model's individual matmuls are too big to fit on one GPU's HBM. A 70B parameter model in BF16 has roughly 140 GB of weights, which does not fit in one H100's 80 GB. With TP=8, each GPU holds 17.5 GB of weights, which leaves room for activations and optimizer state. For larger models (175B, 405B, 671B), TP becomes a hard requirement, not a tuning choice.

Beyond memory, TP also reduces the per-matmul wall-clock. A [B, H] x [H, H'] matmul that takes T seconds on one GPU takes roughly T/N on a TP=N group, modulo the all-reduce overhead. The all-reduce is small relative to the matmul itself when H and H' are large, so TP scales nearly linearly inside the NVLink domain.

Where TP fights with other strategies

TP combines naturally with sequence parallelism, which shards the activation tensors along the sequence dimension to reduce activation memory. TP and SP together is what Megatron calls "tensor parallel with sequence parallel" and is the standard configuration for large dense models.

TP combines with pipeline parallelism along an orthogonal axis: TP groups stay inside a node, PP groups span nodes within a pod. This is the TP × PP layer of 3D parallelism.

TP does not combine cleanly with FSDP/ZeRO-3 because both want to shard the same tensors. The standard pattern is TP for intra-node sharding and FSDP for cross-node sharding, where FSDP's all-gather happens at the TP group boundary.

What this means in practice

Set TP degree based on the size of your NVLink domain. TP=8 on HGX H100, TP=16 or higher on NVL72 only when the model demands it.
Verify the TP group lives inside one NVLink domain via nvidia-smi topo -m and NCCL's startup log. If NCCL routes TP collectives over IB (NET/IB transport), the placement is broken and the step time will be ~50x slower.
Megatron-LM and DeepSpeed both implement the column-then-row pattern. PyTorch's native torch.distributed.tensor.parallel does too. The framework picks the column-vs-row choice automatically; you just choose the degree.
For activation memory pressure (long sequences, large batch), TP alone is not enough. Add sequence parallelism on top.
The hidden dimension H must be divisible by the TP degree. Models are usually designed with H = multiple of 128 so TP=8 or TP=16 always divides cleanly. Custom architectures with awkward dimensions break TP.

TP is the first parallelism axis you reach for when the model stops fitting. Everything else builds on top of it.

What gets sharded

Why TP is NVLink-only

When TP earns its keep

Where TP fights with other strategies

What this means in practice

See also