Networking

InfiniBand

High-bandwidth, low-latency network fabric for GPU clusters.

What it is

InfiniBand is a high-performance networking technology widely used in GPU clusters for inter-node communication, offering up to 400 Gb/s per port (NDR) with sub-microsecond latency and native RDMA support. Its lossless, credit-based flow control and hardware offload of collective operations via SHARP make it the preferred interconnect for large-scale distributed training where AllReduce latency directly impacts training throughput.

Why it matters

InfiniBand link quality directly determines inter-node collective throughput. Symbol errors, link downed events, and port error counter growth signal physical layer degradation that degrades AllReduce performance across all jobs sharing the affected switch port. Unlike NVLink errors, a single degraded IB port can affect multiple nodes on the same switch.

How to monitor

Query InfiniBand port error counters via perfquery or the ibdiagnet utility, watching SymbolErrors, LinkDowned, and PortRcvErrors. Correlate IB link events with NCCL timeout logs to identify whether inter-node or intra-node bandwidth is the bottleneck. Factryze ingests IB fabric telemetry alongside DCGM metrics to provide unified cross-stack visibility.

Related terms

GPUDirect RDMA

Direct GPU memory access across the network, bypassing CPU copies.

SHARP (Scalable Hierarchical Aggregation and Reduction Protocol)

In-network compute for accelerating collective operations.

Network Fabric

The physical interconnect topology connecting all nodes in a cluster.

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free