Topology-Aware Placement

Scheduling GPU jobs by NVLink domain, NUMA affinity, and network switch locality.

What it is

Topology-aware placement is a scheduling optimization that places multi-GPU workloads on GPUs sharing the fastest interconnect paths by considering the full hardware topology hierarchy: NVLink domains (GPUs connected via NVSwitch at 900 GB/s on H100), NUMA affinity (GPUs and NICs on the same CPU socket and PCIe root complex), and inter-node network switch locality. Slurm's topology plugin and Kubernetes' topology manager (topology-manager-policy set to best-effort or restricted) both support this using hwloc data and nvidia-smi topo -m output.

Why it matters

Placing an 8-GPU training job on GPUs spanning two NVLink domains within a DGX H100 forces NCCL AllReduce to traverse PCIe instead of NVLink for half of the communications, reducing collective throughput by 30-40%. NUMA misaffinity adds 200-500ns latency per RDMA operation. A 32-GPU job scattered across switch tiers instead of placed under one leaf switch incurs a 25% throughput penalty -- a silent cost that appears as slow training rather than an error.

How to monitor

Verify placement quality via nvidia-smi topo -m to confirm GPU-NIC NUMA affinity and NVLink domain membership. Correlate NCCL collective throughput with the assigned placement topology. Factryze maintains a continuously updated topology model accounting for degraded NVLinks and failed NVSwitch ASICs, feeding adjusted topology weights into scheduling decisions to optimize for actual current connectivity rather than the designed blueprint.

Related terms

NVSwitch

NVIDIA's NVLink switch enabling all-to-all GPU communication.

Network Fabric

The physical interconnect topology connecting all nodes in a cluster.

Gang Scheduling

Atomic co-scheduling of all GPUs for distributed training requiring synchronized start.

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free