Skip to main content
GPU Glossary/Networking
Networking

InfiniBand

High-bandwidth, low-latency network fabric for GPU clusters.

What it is

InfiniBand is a high-performance networking technology widely used in GPU clusters for inter-node communication, offering up to 400 Gb/s per port (NDR) with sub-microsecond latency and native RDMA support. Its lossless, credit-based flow control and hardware offload of collective operations via SHARP make it the preferred interconnect for large-scale distributed training where AllReduce latency directly impacts training throughput.

Why it matters

InfiniBand link quality directly determines inter-node collective throughput. Symbol errors, link downed events, and port error counter growth signal physical layer degradation that degrades AllReduce performance across all jobs sharing the affected switch port. Unlike NVLink errors, a single degraded IB port can affect multiple nodes on the same switch.

How to monitor

Query InfiniBand port error counters via perfquery or the ibdiagnet utility, watching SymbolErrors, LinkDowned, and PortRcvErrors. Correlate IB link events with NCCL timeout logs to identify whether inter-node or intra-node bandwidth is the bottleneck. Factryze ingests IB fabric telemetry alongside DCGM metrics to provide unified cross-stack visibility.

InfiniBand - Fat-Tree Network TopologyInfiniBand - Fat-Tree Network Topology
Pinch to zoom, drag to pan, double-tap to toggle
InfiniBand - Fat-Tree Network TopologyInfiniBand - Fat-Tree Network Topology

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free