PCIe Bandwidth
Measured GPU-to-host data transfer rate over PCI Express in GB/s.
What it is
PCIe bandwidth is the measured data transfer rate between the GPU and host system over PCI Express, tracked for transmit and receive separately via DCGM_FI_DEV_PCIE_TX_THROUGHPUT and DCGM_FI_DEV_PCIE_RX_THROUGHPUT (in KB/s). Theoretical maximums are 32 GB/s per direction for PCIe Gen4 x16 and 64 GB/s for Gen5 x16, with real-world throughput at approximately 25 GB/s and 50 GB/s respectively. PCIe bandwidth is critical for data loading pipelines, frequent host-to-GPU transfers, and checkpoint writes.
Why it matters
A sudden drop to exactly 50% of expected PCIe throughput is the signature of link width degradation from x16 to x8 -- a hardware fault from marginal connections, damaged traces, or riser card issues that generates no Xid error and no default DCGM alert. A data-loading pipeline that drops from 24 GB/s to 12 GB/s will starve the GPU, causing utilization to fall from 95% to 60% with no error message in any log. This is one of the most common silent hardware failures in dense GPU deployments.
How to monitor
Track DCGM_FI_DEV_PCIE_TX_THROUGHPUT and DCGM_FI_DEV_PCIE_RX_THROUGHPUT and compare against expected throughput for the negotiated link generation and width. Confirm negotiated width via nvidia-smi --query-gpu=pcie.link.width.current --format=csv. Factryze monitors PCIe throughput patterns and flags bandwidth degradation consistent with link width downtraining before it causes GPU starvation.
DCGM_FI_DEV_PCIE_TX_THROUGHPUT / DCGM_FI_DEV_PCIE_RX_THROUGHPUTRelated terms
The host bus connecting GPUs to CPUs and other system devices.
Xid 79 error: GPU completely disconnects from the PCIe bus.
Continuous tracking of GPU health, thermals, errors, and performance metrics.
Monitor this automatically
Factryze correlates GPU signals in real time: errors, clocks, and fabric health.
Get Started Free