Adaptive Routing
Dynamic path selection in network switches to avoid congestion.
What it is
Adaptive routing is a network switch feature that dynamically selects packet paths based on real-time port congestion and load, rather than a fixed routing table. In fat-tree or dragonfly GPU cluster topologies, adaptive routing distributes traffic across all available uplinks and can improve aggregate bisection bandwidth by 15-30%.
Why it matters
Adaptive routing can cause packet reordering, which interacts poorly with RoCE (requiring in-order delivery for RDMA) and may interfere with SHARP tree configurations. Improper threshold tuning can create oscillation between paths that increases effective latency rather than reducing it. In large-scale training, misconfigured adaptive routing appears as jittery AllReduce completion times rather than a consistent throughput degradation.
How to monitor
Monitor per-port utilization on spine and leaf switches to confirm traffic is distributing across uplinks as expected. Watch NCCL collective completion time variance -- high jitter with consistent average bandwidth often indicates path oscillation from aggressive adaptive routing thresholds. Factryze correlates collective timing jitter with fabric telemetry to identify adaptive routing configuration issues.
Related terms
The physical interconnect topology connecting all nodes in a cluster.
In-network compute for accelerating collective operations.
Lost network packets indicating congestion or hardware errors.
Monitor this automatically
Factryze correlates GPU signals in real time: errors, clocks, and fabric health.
Get Started Free