Skip to main content
GPU Glossary/Networking
Networking

Network Fabric

The physical interconnect topology connecting all nodes in a cluster.

What it is

Network fabric refers to the complete physical network topology connecting compute nodes in a GPU cluster, including switches, cables, optics, and their arrangement -- fat-tree, dragonfly, or rail-optimized. The fabric design determines bisection bandwidth, latency characteristics, and the blast radius of a switch or link failure.

Why it matters

A single failed spine switch in a fat-tree fabric can cut bisection bandwidth by 50% for all node pairs routing through it, degrading every training job in the cluster simultaneously. Optic degradation is gradual and silent, appearing as intermittent CRC errors that accumulate into NCCL timeouts. Fabric health is a cluster-wide concern -- one bad cable can affect hundreds of GPUs.

How to monitor

Track per-port link error rates and optic power levels via InfiniBand subnet manager telemetry and switch SNMP/OpenConfig counters. Monitor switch buffer utilization for congestion hotspots. Factryze ingests fabric telemetry as a first-class data source alongside DCGM, correlating switch-level events with per-GPU performance degradation to identify fabric root causes.

Network Fabric - Leaf-Spine ArchitectureNetwork Fabric - Leaf-Spine Architecture
Pinch to zoom, drag to pan, double-tap to toggle
Network Fabric - Leaf-Spine ArchitectureNetwork Fabric - Leaf-Spine Architecture

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free