Skip to main content
GPU Glossary/Networking
Networking

RoCE (RDMA over Converged Ethernet)

RDMA networking over Ethernet for GPU cluster communication.

What it is

RoCE (RDMA over Converged Ethernet) is a network protocol that enables Remote Direct Memory Access over standard Ethernet infrastructure, providing low-latency, high-throughput GPU-to-GPU communication without InfiniBand hardware. RoCE v2 operates over UDP/IP and requires Priority Flow Control (PFC) or ECN-based congestion control to emulate lossless behavior.

Why it matters

RoCE is more sensitive to network congestion than InfiniBand because Ethernet is not inherently lossless. Misconfigured PFC settings can trigger PFC storms that deadlock portions of the fabric, stalling all NCCL collectives across affected nodes simultaneously. Even transient packet drops under bursty AllReduce traffic can cause NCCL timeouts that appear as GPU failures.

How to monitor

Monitor NIC and switch counters for PFC pause frame counts, ECN-marked packet rates, and dropped packet counts at both NIC and switch port levels. Correlate packet drop events with NCCL timeout logs. Factryze correlates NIC-level RoCE counters with NCCL debug logs to distinguish network congestion from GPU hardware failures.

RoCE v2 - RDMA over Converged EthernetRoCE v2 - RDMA over Converged Ethernet
Pinch to zoom, drag to pan, double-tap to toggle
RoCE v2 - RDMA over Converged EthernetRoCE v2 - RDMA over Converged Ethernet

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free