SHARP (Scalable Hierarchical Aggregation and Reduction Protocol)
In-network compute for accelerating collective operations.
What it is
SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) is an NVIDIA Networking technology that offloads collective operations like AllReduce directly to InfiniBand switches, performing data aggregation in the network fabric rather than at endpoints. It requires compatible InfiniBand switch firmware and careful tree configuration.
Why it matters
SHARP can reduce AllReduce latency by up to 2x and free GPU compute cycles from reduction math, directly improving training throughput for data-parallel workloads where AllReduce is on the critical path. When SHARP is misconfigured or a switch firmware upgrade breaks compatibility, NCCL silently falls back to endpoint-based reductions with no error -- throughput degradation is the only signal.
How to monitor
Verify SHARP is active via NCCL_DEBUG=INFO logs and check for SHARP tree resource allocation in the subnet manager logs. Monitor IB switch firmware versions for SHARP compatibility after any upgrade. Factryze tracks NCCL collective throughput as a proxy for SHARP health and flags unexpected inter-node AllReduce latency increases.
Related terms
Monitor this automatically
Factryze correlates GPU signals in real time: errors, clocks, and fabric health.
Get Started Free