Topology-Aware Placement

A model defines a parallelism strategy. The strategy implies a communication pattern: tensor parallel groups talk a lot, often, in small chunks; data parallel groups talk less often but with bigger payloads; expert parallel groups talk in bursts to specific peers. The cluster has a physical hierarchy: NVLink inside the node or rack, IB between racks. Topology-aware placement is the act of matching one to the other.

The wrong placement, in numbers

Take a TP=4 group on H100. Inside one HGX node, this group runs on NVLink at ~ 360 GB/s effective per-direction bandwidth. Each TP all-reduce of a hidden state slice is a few microseconds. Now place the same TP=4 group across two HGX nodes (rank 0 and 1 on node 0, rank 2 and 3 on node 1). Each TP all-reduce now has to cross IB at the slowest link: roughly 50 GB/s per IB port. The bandwidth ratio is 360 / 50 = 7.2x, and once you account for additional latency overhead (the per-message latency floor on IB is ~1 us versus ~ 0.1 us on NVLink) the small-message cost rises further. For typical TP message sizes (a few MB), the slowdown is ~50x compared to the NVLink-resident case. For a TP-heavy training step, this is the difference between a model that trains in 30 days and one that trains in 4 years.

The placement rule

Different parallelism strategies have different bandwidth needs. The hierarchy that matches H100 / B200 hardware is:

TP (tensor parallel): very small messages, very frequent, latency-sensitive. Place inside one NVLink domain (HGX node, NVL72 rack). Bandwidth ceiling: NVLink5 1.8 TB/s per GPU.
EP (expert parallel): all-to-all bursts to specific peers, medium message size. Place across nodes if the EP group spans them, but try to keep within one rail-optimized fat-tree leaf to limit cross-cut bandwidth.
PP (pipeline parallel): point-to-point sends, medium frequency, medium size. Tolerates IB; can cross node and rack boundaries with a modest cost. Place across nodes within the same fabric pod.
DP (data parallel): large all-reduces, lower frequency. Tolerates IB well, especially in rail-optimized form. Place across nodes; the per-rail isolation makes DP all-reduce run at full per-rail speed.

Most production training jobs combine these: TP=4 or TP=8 inside the node (or up to TP=72 inside an NVL72), PP=2-8 across nodes within a pod, DP across all remaining GPUs.

How tools encode this

Slurm: the topology.conf file describes the cluster as a tree of switches and nodes. Slurm's topology plugin uses this to allocate GPUs that minimize the network distance for a job's communication pattern. The --switches constraint in sbatch tells Slurm that the job needs all nodes within a given number of switch hops. For a TP=8 job on H100, you typically use --switches=1 to force all 8 GPUs onto the same leaf, but for tight TP, the right answer is to constrain to a single node (e.g., one GPU per node is wrong; you want all 8 GPUs from one node).

Kubernetes: the standard topology labels (topology.kubernetes.io/zone, topology.kubernetes.io/region, plus custom labels like nvidia.com/nvlink-domain) drive the scheduler's placement. NVIDIA's GPU operator and the NCCL team publish patterns for k8s pod affinity rules that keep TP groups co-located. For DP, anti-affinity rules spread replicas across racks for fault isolation.

NCCL: even with correct placement, NCCL's communicator topology must match the physical hierarchy. The library auto-detects this from nvidia-smi topo -m, but for non-trivial deployments you may set NCCL_TOPO_FILE or NCCL_SOCKET_IFNAME to point it at the right interfaces. A misconfigured NCCL communicator will route TP traffic over IB even when NVLink is available.

What goes wrong in practice

The single most common operational mistake is fragmented allocation: a job requests 16 GPUs and gets 2 from each of 8 nodes, instead of 8 from each of 2 nodes. This hammers IB even for TP-sized groups. Fix with --ntasks-per-node=8 (Slurm) or topologySpreadConstraints (k8s).
NUMA mismatch inside a node: GPUs split across NUMA domains add CPU memory bounce overhead even within one node. NCCL's auto-detection usually handles this, but always check nvidia-smi topo -m reports NV12 or similar (full NVLink) between all GPUs in a TP group.
MIG slices (MIG partitioning) bypass NVLink. Two MIG slices on the same physical GPU communicate over PCIe, which is much slower than NVLink. If you are using MIG for ML, do not put TP groups on MIG slices.

What this means in practice

Before launching a job, sketch the parallelism strategy and ask: "for each communication pattern, which physical tier does it need?" If the answer is NVLink, the placement must keep the group inside one NVLink domain. If it is IB, place it across nodes inside one rail-optimized pod.
Use Slurm's --switches=N and --ntasks-per-node=8 (or k8s pod affinity) to enforce the placement, not to advise it. The scheduler will respect hard constraints; it ignores soft hints for large jobs.
Verify with nvidia-smi topo -m and NCCL's startup log. NCCL prints which transport (P2P, NVLink, NET/IB) it is using for each rank pair. If TP rank pairs show NET/IB, the placement is wrong.
The wrong-tier penalty compounds across the training step. A 50x slowdown on TP all-reduce, repeated thousands of times per step, dominates the wall-clock even if everything else is fast.
For bisection bandwidth-bound workloads (MoE, all-to-all-heavy), placement also affects which side of the cut your traffic lives on. Keep all-to-all groups inside one fabric pod.

Topology-aware placement is the operator decision that makes the whole interconnect chapter pay off. The wires are good; the placement decides whether your workload uses them.