NVSwitch
NVIDIA's NVLink switch enabling all-to-all GPU communication.
What it is
NVSwitch is NVIDIA's dedicated NVLink switching chip that connects all GPUs within a node into a fully non-blocking, all-to-all topology. In DGX H100 systems, six NVSwitch chips provide every GPU pair with full bidirectional NVLink bandwidth, eliminating multi-hop routing. NVSwitch 3.0 also supports in-network compute operations like multicast and reductions.
Why it matters
A failed or degraded NVSwitch ASIC reduces the entire node's inter-GPU bandwidth, not just a single link -- all GPU pairs that route through that chip are affected. In-network compute features like SHARP reductions are disabled when an NVSwitch fails, increasing AllReduce latency for every training job on the node. NVSwitch failure is often silent until NCCL throughput degrades measurably.
How to monitor
Monitor NVSwitch health via the NVIDIA Fabric Manager logs and nvswitchctl utility. Correlate per-link NVLink bandwidth drops (DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL) across multiple GPUs on the same node to identify a switch-level rather than cable-level fault. Factryze incorporates NVSwitch health data into its topology model and adjusts scheduling and bandwidth expectations accordingly.
Related terms
NVIDIA's high-bandwidth interconnect for GPU-to-GPU communication.
Scheduling GPU jobs by NVLink domain, NUMA affinity, and network switch locality.
Updating GPU InfoROM, VBIOS, and NVSwitch firmware during scheduled maintenance windows.
Monitor this automatically
Factryze correlates GPU signals in real time: errors, clocks, and fabric health.
Get Started Free