Networking

NVSwitch

NVIDIA's NVLink switch enabling all-to-all GPU communication.

What it is

NVSwitch is NVIDIA's dedicated NVLink switching chip that connects all GPUs within a node into a fully non-blocking, all-to-all topology. In DGX H100 systems, six NVSwitch chips provide every GPU pair with full bidirectional NVLink bandwidth, eliminating multi-hop routing. NVSwitch 3.0 also supports in-network compute operations like multicast and reductions.

Why it matters

A failed or degraded NVSwitch ASIC reduces the entire node's inter-GPU bandwidth, not just a single link -- all GPU pairs that route through that chip are affected. In-network compute features like SHARP reductions are disabled when an NVSwitch fails, increasing AllReduce latency for every training job on the node. NVSwitch failure is often silent until NCCL throughput degrades measurably.

How to monitor

Monitor NVSwitch health via the NVIDIA Fabric Manager logs and nvswitchctl utility. Correlate per-link NVLink bandwidth drops (DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL) across multiple GPUs on the same node to identify a switch-level rather than cable-level fault. Factryze incorporates NVSwitch health data into its topology model and adjusts scheduling and bandwidth expectations accordingly.

Related terms

NVLink

NVIDIA's high-bandwidth interconnect for GPU-to-GPU communication.

Topology-Aware Placement

Scheduling GPU jobs by NVLink domain, NUMA affinity, and network switch locality.

Firmware Update

Updating GPU InfoROM, VBIOS, and NVSwitch firmware during scheduled maintenance windows.

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free