RoCE vs InfiniBand

The choice between InfiniBand and RoCE is rarely about line rate. NDR ConnectX-7 and 400G ConnectX-7 Lx ship the same 50 GB/s per port. The choice is about who owns the lossless property of the network and what they have to do to keep it that way.

Same verbs, different lossless story

Both protocols expose RDMA verbs at the top of the stack. An application calls ibv_post_send to enqueue a work request, the NIC consumes it and pushes bytes onto the wire, and on the remote end the NIC writes the bytes directly into the destination buffer (HBM, with GPUDirect RDMA). The verb layer is identical. The difference is what happens between the NIC and the wire.

InfiniBand native runs on a dedicated IB link layer with credit-based flow control. Every packet sent decrements a credit at the sender; the receiver returns credits as it drains its buffers. The fabric never drops a packet for buffer reasons because the sender never sends without credit. This is a hardware-level guarantee that lives in the IB switch silicon and the IB NIC firmware. There is no software tuning for "lossless" because losslessness is a built-in property of the link layer.

RoCE v2 runs RDMA over UDP/IP over Ethernet. To make Ethernet behave like a lossless transport, RoCE relies on two layer-2 features. PFC (Priority Flow Control, 802.1Qbb) lets a switch send a "pause" frame upstream when its buffers fill, throttling the sender for a specific traffic class. ECN (Explicit Congestion Notification, RFC 3168) lets switches mark packets when queues build up; the receiver echoes the mark back to the sender, and the sender's RDMA congestion control (DCQCN or HPCC) reduces its rate. PFC stops drops; ECN stops standing queues. Together they approximate IB's lossless guarantee, but only if both are configured correctly across every switch hop.

What goes wrong

The PFC + ECN configuration story is where RoCE deployments fail. PFC requires every switch and every NIC port in the path to agree on which priority class is lossless, which buffers serve that class, and what the buffer thresholds are. A misconfigured port can deadlock the entire fabric (PFC storm) or silently drop packets in the lossless class (PFC bypass). ECN requires every switch to agree on the marking thresholds and every NIC's congestion control algorithm to react to the marks consistently. Modern Ethernet switches expose these as named profiles (Mellanox SwitchX, Arista EOS), but the burden of maintaining them sits with the network team rather than the NIC vendor.

InfiniBand sidesteps all of this. The fabric is an appliance: you buy the switches and the cables and the NICs from one vendor (Mellanox/NVIDIA), wire them together, and the lossless property is a fact about the hardware. The operational cost is that you have one vendor and one set of tools (ibstat, ibtraffic, ibdiagnet); the operational benefit is that you spend that complexity budget once.

What the latency gap actually is

A tuned, well-configured RoCE fabric runs RDMA write latency at roughly 1.5 to 2 microseconds. A tuned IB fabric runs the same at roughly 1 microsecond. The gap is not the line rate; both are 50 GB/s per port at NDR/400G. The gap is the extra header processing in RoCE (UDP, IP, Ethernet headers add bytes and parsing time) and the extra hop through any DSCP/PFC priority queue logic in the switches. For most ML workloads (large messages, bandwidth-bound), this latency delta is invisible. For latency-bound workloads (tensor parallel with frequent small messages, see tree-reduce latency), it is closer to a 1.5-2x penalty on the small-message regime.

What this means in practice

Choose IB native when (1) you have a homogeneous Mellanox/NVIDIA fabric, (2) the operations team prefers an appliance over a configurable network, (3) latency-sensitive workloads matter. Most large training clusters run IB native.
Choose RoCE v2 when (1) your network team already runs Ethernet at scale and has DCQCN/HPCC experience, (2) you need to share the fabric with non-RDMA traffic (compute Ethernet, storage Ethernet), (3) per-port cost matters more than latency. Hyperscalers commonly run RoCE because their network teams already operate massive Ethernet fabrics.
Whichever you pick, GPUDirect RDMA works on both. The fabric layer below the verb layer is what you choose; the verb layer above is the same.
Debug RoCE losslessness with switch counters: PFC pause frame counts, ECN marked packet counts, and most importantly any drop counter on the lossless class. Drops on the lossless class mean PFC is misconfigured somewhere.
Debug IB the appliance way: ibdiagnet end-to-end, ibstat per-port, watch the switch port error counters. Both are well-instrumented; they are just instrumented differently.

The two protocols deliver the same RDMA bandwidth. They differ in who pays the operational cost to make that bandwidth lossless.

Same verbs, different lossless story

What goes wrong

What the latency gap actually is

What this means in practice

See also