Networking

Network Fabric

The physical interconnect topology connecting all nodes in a cluster.

What it is

Network fabric refers to the complete physical network topology connecting compute nodes in a GPU cluster, including switches, cables, optics, and their arrangement -- fat-tree, dragonfly, or rail-optimized. The fabric design determines bisection bandwidth, latency characteristics, and the blast radius of a switch or link failure.

Why it matters

A single failed spine switch in a fat-tree fabric can cut bisection bandwidth by 50% for all node pairs routing through it, degrading every training job in the cluster simultaneously. Optic degradation is gradual and silent, appearing as intermittent CRC errors that accumulate into NCCL timeouts. Fabric health is a cluster-wide concern -- one bad cable can affect hundreds of GPUs.

How to monitor

Track per-port link error rates and optic power levels via InfiniBand subnet manager telemetry and switch SNMP/OpenConfig counters. Monitor switch buffer utilization for congestion hotspots. Factryze ingests fabric telemetry as a first-class data source alongside DCGM, correlating switch-level events with per-GPU performance degradation to identify fabric root causes.

Related terms

InfiniBand

High-bandwidth, low-latency network fabric for GPU clusters.

Adaptive Routing

Dynamic path selection in network switches to avoid congestion.

Topology-Aware Placement

Scheduling GPU jobs by NVLink domain, NUMA affinity, and network switch locality.

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free