Packet Drops
Lost network packets indicating congestion or hardware errors.
What it is
Packet drops are network packets discarded by switches or NICs due to buffer overflow, congestion, or hardware faults. In GPU cluster networking, drops can occur at ingress queues on oversubscribed switch ports or on NIC receive queues when the host is slow to drain.
Why it matters
In GPU clusters, packet drops are particularly damaging because NCCL collective operations stall until retransmission completes, converting a microsecond-scale network event into a millisecond-scale training delay visible across all ranks. InfiniBand's credit-based flow control makes drops rare and indicative of hardware faults; any IB drop is a serious signal. RoCE environments under bursty AllReduce traffic are much more susceptible and require careful PFC and ECN tuning.
How to monitor
Track PortXmitDiscards and PortRcvErrors via InfiniBand perfquery for IB fabrics; for RoCE and Ethernet track port discards via switch SNMP counters and NIC driver stats (ethtool -S). Correlate packet drop events with NCCL timeout timestamps to confirm causation. Factryze correlates network drop counters with NCCL collective timing to identify whether drop events are causally linked to training stalls.
Related terms
RDMA networking over Ethernet for GPU cluster communication.
Collective communication failures in NVIDIA NCCL stalling distributed training.
The physical interconnect topology connecting all nodes in a cluster.
Monitor this automatically
Factryze correlates GPU signals in real time: errors, clocks, and fabric health.
Get Started Free