InfiniBand
High-bandwidth, low-latency network fabric for GPU clusters.
What it is
InfiniBand is a high-performance networking technology widely used in GPU clusters for inter-node communication, offering up to 400 Gb/s per port (NDR) with sub-microsecond latency and native RDMA support. Its lossless, credit-based flow control and hardware offload of collective operations via SHARP make it the preferred interconnect for large-scale distributed training where AllReduce latency directly impacts training throughput.
Why it matters
InfiniBand link quality directly determines inter-node collective throughput. Symbol errors, link downed events, and port error counter growth signal physical layer degradation that degrades AllReduce performance across all jobs sharing the affected switch port. Unlike NVLink errors, a single degraded IB port can affect multiple nodes on the same switch.
How to monitor
Query InfiniBand port error counters via perfquery or the ibdiagnet utility, watching SymbolErrors, LinkDowned, and PortRcvErrors. Correlate IB link events with NCCL timeout logs to identify whether inter-node or intra-node bandwidth is the bottleneck. Factryze ingests IB fabric telemetry alongside DCGM metrics to provide unified cross-stack visibility.
Related terms
Direct GPU memory access across the network, bypassing CPU copies.
In-network compute for accelerating collective operations.
The physical interconnect topology connecting all nodes in a cluster.
Monitor this automatically
Factryze correlates GPU signals in real time: errors, clocks, and fabric health.
Get Started Free