Skip to main content
GPU Glossary/Networking
Networking

Adaptive Routing

Dynamic path selection in network switches to avoid congestion.

What it is

Adaptive routing is a network switch feature that dynamically selects packet paths based on real-time port congestion and load, rather than a fixed routing table. In fat-tree or dragonfly GPU cluster topologies, adaptive routing distributes traffic across all available uplinks and can improve aggregate bisection bandwidth by 15-30%.

Why it matters

Adaptive routing can cause packet reordering, which interacts poorly with RoCE (requiring in-order delivery for RDMA) and may interfere with SHARP tree configurations. Improper threshold tuning can create oscillation between paths that increases effective latency rather than reducing it. In large-scale training, misconfigured adaptive routing appears as jittery AllReduce completion times rather than a consistent throughput degradation.

How to monitor

Monitor per-port utilization on spine and leaf switches to confirm traffic is distributing across uplinks as expected. Watch NCCL collective completion time variance -- high jitter with consistent average bandwidth often indicates path oscillation from aggressive adaptive routing thresholds. Factryze correlates collective timing jitter with fabric telemetry to identify adaptive routing configuration issues.

Adaptive Routing - Per-Packet Congestion AvoidanceAdaptive Routing - Per-Packet Congestion Avoidance
Pinch to zoom, drag to pan, double-tap to toggle
Adaptive Routing - Per-Packet Congestion AvoidanceAdaptive Routing - Per-Packet Congestion Avoidance

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free