Skip to main content
GPU Glossary/Networking
Networking

Packet Drops

Lost network packets indicating congestion or hardware errors.

What it is

Packet drops are network packets discarded by switches or NICs due to buffer overflow, congestion, or hardware faults. In GPU cluster networking, drops can occur at ingress queues on oversubscribed switch ports or on NIC receive queues when the host is slow to drain.

Why it matters

In GPU clusters, packet drops are particularly damaging because NCCL collective operations stall until retransmission completes, converting a microsecond-scale network event into a millisecond-scale training delay visible across all ranks. InfiniBand's credit-based flow control makes drops rare and indicative of hardware faults; any IB drop is a serious signal. RoCE environments under bursty AllReduce traffic are much more susceptible and require careful PFC and ECN tuning.

How to monitor

Track PortXmitDiscards and PortRcvErrors via InfiniBand perfquery for IB fabrics; for RoCE and Ethernet track port discards via switch SNMP counters and NIC driver stats (ethtool -S). Correlate packet drop events with NCCL timeout timestamps to confirm causation. Factryze correlates network drop counters with NCCL collective timing to identify whether drop events are causally linked to training stalls.

Packet Drops - PFC Storm Cascade EffectPacket Drops - PFC Storm Cascade Effect
Pinch to zoom, drag to pan, double-tap to toggle
Packet Drops - PFC Storm Cascade EffectPacket Drops - PFC Storm Cascade Effect

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free