RoCE (RDMA over Converged Ethernet)
RDMA networking over Ethernet for GPU cluster communication.
What it is
RoCE (RDMA over Converged Ethernet) is a network protocol that enables Remote Direct Memory Access over standard Ethernet infrastructure, providing low-latency, high-throughput GPU-to-GPU communication without InfiniBand hardware. RoCE v2 operates over UDP/IP and requires Priority Flow Control (PFC) or ECN-based congestion control to emulate lossless behavior.
Why it matters
RoCE is more sensitive to network congestion than InfiniBand because Ethernet is not inherently lossless. Misconfigured PFC settings can trigger PFC storms that deadlock portions of the fabric, stalling all NCCL collectives across affected nodes simultaneously. Even transient packet drops under bursty AllReduce traffic can cause NCCL timeouts that appear as GPU failures.
How to monitor
Monitor NIC and switch counters for PFC pause frame counts, ECN-marked packet rates, and dropped packet counts at both NIC and switch port levels. Correlate packet drop events with NCCL timeout logs. Factryze correlates NIC-level RoCE counters with NCCL debug logs to distinguish network congestion from GPU hardware failures.
Related terms
Monitor this automatically
Factryze correlates GPU signals in real time: errors, clocks, and fabric health.
Get Started Free