Silent GPU Failures: ECC Errors, Thermal Throttling, and How to Detect Them
GPUs fail silently - degraded performance, rising ECC errors, thermal throttling that goes unnoticed. Learn the warning signs and how to catch them before they crash your training.
GPUs do not fail like CPUs. A CPU either works or it does not. GPUs degrade - slowly, silently, over days or weeks - in ways that corrupt training results, reduce throughput by 30%, or cause a 256-GPU job to crash at hour 47 of a 48-hour run. Silent GPU failures are the most expensive kind of failure because you pay for the GPU time, get bad results, and only discover the problem after the damage is done. Detecting ECC errors, thermal throttling, PCIe degradation, and NVLink errors before they become catastrophic is the difference between a 2-minute automated remediation and a week of debugging corrupted model weights.
This guide covers the four categories of silent GPU failures, the specific DCGM metrics that expose each one, the threshold values that should trigger action, and a real-world scenario showing how a single degrading GPU can silently ruin a large training run.
What Makes a GPU Failure Silent?
A silent failure is any GPU degradation that does not generate an obvious error, crash the running process, or trigger a standard monitoring alert. The GPU continues to execute kernels, the process stays alive, and DCGM_FI_DEV_GPU_UTIL continues to report 95%. But the work being done is either slower than expected, producing incorrect results, or accumulating damage that will eventually cause a hard crash.
Silent failures are dangerous because they exist in the gap between "working" and "broken." Traditional GPU monitoring with static threshold alerts catches hard failures (GPU fell off the bus, driver crash, OOM kill). It does not catch the GPU that is running 15% slower than its peers, or the GPU whose memory is silently flipping bits at a rate that will trigger a double-bit error within 48 hours.
There are four primary categories of silent GPU failure, each with distinct metrics, signatures, and timelines.
How Do ECC Errors Cause Silent GPU Failures?
ECC (Error-Correcting Code) errors are the most predictive signal for imminent GPU hardware failure. GPU HBM memory cells degrade over time, and the ECC circuitry catches and corrects single-bit errors (SBE) before they affect computation. The correction is invisible to the application - the GPU returns the right result. But each corrected error represents a physical memory cell that is failing.
The Degradation Timeline
ECC error accumulation follows a predictable pattern:
- Early stage (weeks to months before failure): Occasional single-bit errors, 0-3 per day. This is normal for large HBM arrays and does not indicate a problem.
- Acceleration stage (days to weeks before failure): SBE rate increases to 10-50 per day, concentrated on specific memory rows. Row remapping activates to swap faulty rows for spares.
- Critical stage (hours to days before failure): SBE rate exceeds 50 per day with bursts. Row remapping spares are exhausted (
DCGM_FI_DEV_ROW_REMAP_FAILURE = 1). Page retirement activates. - Failure: A double-bit error (DBE) occurs on a memory cell that ECC cannot correct. Xid 48 fires. The CUDA context is poisoned. The training job crashes.
DCGM Metrics to Watch
| Metric | Normal Value | Warning Threshold | Critical Threshold |
|---|---|---|---|
| DCGM_FI_DEV_ECC_SBE_VOL_TOTAL | 0-3 per day | >10 per day sustained | >50 per day |
| DCGM_FI_DEV_ECC_DBE_VOL_TOTAL | 0 | >0 (any DBE is critical) | Immediate drain |
| DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS | 0-5 lifetime | >10 lifetime | >20 lifetime |
| DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS | 0 | >0 | Immediate drain |
| DCGM_FI_DEV_ROW_REMAP_FAILURE | 0 | 1 (spares exhausted) | Immediate drain |
| DCGM_FI_DEV_RETIRED_SBE + DCGM_FI_DEV_RETIRED_DBE | 0-10 lifetime | >30 total | >60 total (RMA) |
Why This Fails Silently
Single-bit errors are corrected by hardware. The application never sees them. Training accuracy is not affected. DCGM_FI_DEV_GPU_UTIL stays at 95%. Nothing in application logs suggests a problem. The only signal is the ECC counter incrementing in DCGM telemetry, and if your monitoring stack does not track ECC counter rates (not just absolute values), you will not notice until the inevitable DBE crashes the job.
The critical distinction is between count and rate. A GPU with 40 lifetime SBEs accumulated over 18 months is likely stable. A GPU that accumulated 40 SBEs in the last 48 hours is actively failing. Monitoring systems that only alert on absolute thresholds miss the rate signal entirely.
How Does Thermal Throttling Silently Degrade GPU Performance?
Thermal throttling is the GPU firmware's self-preservation mechanism. When the die temperature exceeds the slowdown threshold (typically 83 degrees Celsius for data center GPUs like A100 and H100), the firmware progressively reduces SM clock and memory clock frequencies to bring temperature under control. The reduction is proportional to how far the temperature exceeds the threshold.
The Performance Impact
The throughput impact of thermal throttling is progressive and significant:
- 83-85 degrees Celsius: SM clock drops 100-200 MHz (5-10% throughput reduction)
- 85-88 degrees Celsius: SM clock drops 200-400 MHz (10-20% throughput reduction)
- 88-90 degrees Celsius: SM clock drops 400+ MHz (20-35% throughput reduction)
- Above 92 degrees Celsius: Thermal shutdown. GPU powers off entirely.
A GPU throttling at 86 degrees delivers roughly 90% of its rated throughput. In a data-parallel training job, this GPU becomes the straggler that forces every other GPU in the job to wait at every AllReduce synchronization point. A single throttling GPU in a 128-GPU job can reduce effective throughput of the entire job by 10%.
DCGM Metrics to Watch
| Metric | Normal Value | Warning Threshold | Critical Threshold |
|---|---|---|---|
| DCGM_FI_DEV_GPU_TEMP | 65-78 C under load | >83 C sustained | >88 C |
| DCGM_FI_DEV_MEMORY_TEMP | 70-85 C under load | >90 C sustained | >95 C |
| DCGM_FI_DEV_CLOCK_THROTTLE_REASONS | 0 | Bit 2 or 3 set | Bit 3 set (HW Thermal) |
| DCGM_FI_DEV_SM_CLOCK | Near boost clock | >10% below boost | >25% below boost |
| DCGM_FI_DEV_FAN_SPEED | 40-70% under load | >85% sustained | >95% or stuck at 0% |
Why This Fails Silently
Thermal throttling does not generate an error. It does not crash the process. It does not produce a Xid event in dmesg. DCGM_FI_DEV_GPU_UTIL continues to report 95-100% because the GPU is still busy executing kernels, just at a lower clock rate. The only visible symptom in application-level metrics is slower step times - a 10% increase in time per training step - which many teams attribute to data loading variability, network congestion, or normal noise.
The insidious pattern is intermittent thermal throttling that correlates with environmental conditions. A cluster that runs at full speed during night hours but throttles 5-15% during afternoon peak ambient temperature has a cooling infrastructure problem, not a GPU problem. But the symptom looks like random training slowdowns.
Peer Comparison: The Detection Key
The most reliable way to catch thermal throttling is not by monitoring absolute temperature. It is by comparing each GPU against its peers in the same node and the same workload. If GPUs 0-3 in a DGX node run at 78 degrees and GPUs 4-7 run at 87 degrees under the same workload, the rear GPUs have an airflow problem regardless of whether 87 degrees triggers a static alert threshold. This peer comparison approach catches problems that absolute thresholds miss.
How Does PCIe Degradation Silently Cut GPU Bandwidth?
PCIe link degradation is one of the stealthiest hardware failures in GPU infrastructure. When a GPU's PCIe link trains down from its rated width (typically x16) to a lower width (x8 or x4), the available bandwidth between the GPU and the host CPU drops by 50% or 75%. This halves or quarters the effective data transfer rate for CPU-to-GPU operations.
How PCIe Degradation Happens
PCIe link width is negotiated during system boot and can be renegotiated during runtime if the physical layer detects signal integrity problems. Common causes include:
- Marginal physical connections (GPU not fully seated in the PCIe slot)
- Damaged PCIe traces on the motherboard or riser card
- Oxidation or contamination on PCIe connector fingers
- Cable damage in riser card configurations
- Electromagnetic interference from adjacent components
When the PCIe physical layer detects errors, it attempts to retrain the link. Each retraining attempt is logged as a PCIe replay event (DCGM_FI_DEV_PCIE_REPLAY_COUNTER). If retraining at the current width fails, the link falls back to a narrower width.
DCGM Metrics to Watch
| Metric | Normal Value | Warning Threshold | Critical Threshold |
|---|---|---|---|
| DCGM_FI_DEV_PCIE_REPLAY_COUNTER | 0 | >100 per hour | >1000 per hour |
| DCGM_FI_PROF_PCIE_TX_BYTES / DCGM_FI_PROF_PCIE_RX_BYTES | Expected for link width | 50% of expected | 25% of expected |
Why This Fails Silently
PCIe link degradation generates no Xid error, no CUDA error, and no application-visible failure. The GPU continues to operate at full compute capacity - it simply cannot transfer data to and from the host at full speed. For workloads with minimal CPU-GPU data transfer (inference with preloaded model weights), the impact may be negligible. For workloads with heavy data loading (training with large datasets) or frequent checkpointing, PCIe bandwidth is on the critical path and a 50% reduction causes the GPU to starve for data.
The signature in monitoring data is GPU utilization that drops from 95% to 50-60% with periodic dips to 0% as the GPU waits for data. Without PCIe bandwidth monitoring, this pattern is indistinguishable from a slow storage system or an undersized DataLoader.
How Do NVLink Errors Silently Slow Distributed Training?
NVLink provides the high-bandwidth interconnect between GPUs within a node (and across nodes in NVLink Network topologies). NVLink 4.0 on H100 delivers 900 GB/s bidirectional bandwidth per link. When NVLink connections degrade, multi-GPU communication slows down, and distributed training that depends on AllReduce and other collective operations takes a direct throughput hit.
The NVLink Degradation Sequence
NVLink failures follow a progression similar to ECC errors:
- CRC errors: The physical layer detects corrupted data in NVLink packets. These are corrected by retransmission but consume bandwidth and add latency.
- Replay events: The link protocol retransmits corrupted packets. An increasing replay rate indicates worsening signal integrity.
- Recovery events: The link retrains after persistent errors. During recovery, the link is down for milliseconds, causing NCCL operations to stall.
- Link failure: The link goes down permanently. The GPU can no longer communicate with its peers over the affected NVLink connection. Xid 74 fires.
DCGM Metrics to Watch
| Metric | Normal Value | Warning Threshold | Critical Threshold |
|---|---|---|---|
| DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL | 0 | >100 per hour | >1000 per hour |
| DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL | Near rated bandwidth | less than 80% of rated | less than 60% of rated |
Why This Fails Silently
Like PCIe degradation, NVLink errors at low rates are corrected by the link protocol and never surface to the application. The GPU continues computing, NCCL collectives continue completing, and training continues progressing. The only symptom is a subtle increase in AllReduce latency that manifests as a 3-8% increase in training step time.
The dangerous scenario is a single degraded NVLink in a ring AllReduce topology. Because ring AllReduce sends data through every GPU in the ring sequentially, the slowest link determines the throughput of the entire ring. A 20% bandwidth reduction on one NVLink out of 16 in the ring slows down AllReduce for all 8 GPUs in the node.
What Does a Real-World Silent GPU Failure Look Like?
Here is how these failure modes combine to silently destroy a training run.
Day 0: A 256-GPU training job starts on 32 nodes. All GPUs pass DCGM Level 1 health checks. Training begins normally at 95% utilization across all GPUs.
Day 1: GPU 5 on node 17 starts accumulating single-bit ECC errors at 15 per hour - 10x its historical rate. The ECC circuitry corrects every error. No application impact. No alert fires because the monitoring system tracks absolute ECC counts, not rates.
Day 2: Row remapping activates on GPU 5, swapping 3 faulty HBM rows for spares. The row remapping itself causes a brief stall that adds 50ms to one training step. This is lost in the noise of step time variance across 256 GPUs. SBE rate increases to 40 per hour.
Day 3: The degrading HBM stack on GPU 5 causes increased heat output from error correction activity. GPU 5 temperature rises to 86 degrees Celsius - 6 degrees hotter than its peers. Thermal throttling engages, reducing SM clock from 1980 MHz to 1820 MHz. GPU 5 is now 8% slower than every other GPU. Every AllReduce across all 256 GPUs now takes 8% longer because every GPU waits for GPU 5. Effective cluster throughput drops from 95% to 87%. No alert fires because 86 degrees is below the default 90-degree threshold.
Day 4: The first double-bit error (DBE) occurs on GPU 5 at 2:47 AM. Xid 48 fires. The CUDA context on GPU 5 is poisoned. The training process on that rank crashes. Because this is a synchronized data-parallel job, all 255 other GPUs stall at the next AllReduce barrier with an NCCL timeout after 180 seconds. All 256 GPUs are effectively offline.
The cost: 4 days of training on 256 H100 GPUs, wasted. At a conservative estimate of $3 per GPU-hour, that is 4 days x 24 hours x 256 GPUs x $3 = approximately $74,000 in GPU time. Plus the human time to debug, identify the faulty GPU, drain the node, restart from the last checkpoint, and hope the checkpoint from day 2 is still valid.
What autonomous detection catches: On Day 1, the ECC rate anomaly on GPU 5 - 15 SBEs per hour versus a baseline of 1-2 - triggers an investigation. The NOC agent correlates rising SBE rate with row remapping activity and flags GPU 5 for proactive drain. The job is rescheduled on a healthy GPU during the next checkpoint boundary. Total impact: one job rescheduling delay of 10-15 minutes instead of 4 days of wasted compute.
How Do You Build a Silent GPU Failure Detection Strategy?
Detecting silent failures requires three capabilities that standard threshold monitoring does not provide.
Rate-Based Alerting
Track the rate of change of health indicators, not just their absolute values. An ECC counter that went from 10 to 50 in 24 hours is far more alarming than one that went from 0 to 50 over 18 months. Use Prometheus rate() and increase() functions to compute per-hour error rates from monotonic DCGM counters.
Peer Comparison
Compare each GPU against its siblings in the same node, in the same job, and in the same workload class. Deviations from peer behavior catch problems that absolute thresholds miss: a GPU running 5 degrees hotter, 3% slower, or accumulating errors 10x faster than its peers is exhibiting anomalous behavior regardless of whether any individual metric crosses a threshold.
Multi-Signal Correlation
The most dangerous silent failures combine multiple weak signals. Rising ECC rate plus elevated temperature plus slight NVLink bandwidth reduction, individually within normal ranges, collectively tell the story of a GPU on a degradation path. Correlating these signals requires either an experienced SRE who checks all of them simultaneously (unlikely at 3 AM), or an autonomous agent that does it continuously.
Factryze's NOC Agent implements all three detection strategies, ingesting DCGM telemetry, Xid kernel events, and NVLink health data into a continuous anomaly detection system that catches silent failures in their early stages. When a silent failure is detected, the SRE Agent executes the appropriate runbook - drain, reset, power cap, or RMA escalation - compressing MTTR from hours of manual investigation to minutes of automated response.
To learn how autonomous agents handle each failure type, explore our pricing plans or contact our team for a cluster assessment.
Frequently Asked Questions
How many ECC errors per day are normal for a healthy GPU?
For a healthy GPU, 0-3 single-bit ECC errors per day is within normal operating range for large HBM arrays. The key metric is not the absolute count but the rate of change. A sudden jump from 1 SBE per day to 20+ SBEs per day indicates active memory cell degradation and should trigger investigation. Any double-bit error (DBE) is an immediate critical event regardless of count.
Can thermal throttling damage a GPU permanently?
Thermal throttling itself is a protective mechanism that prevents damage. The GPU firmware reduces clock speeds specifically to keep temperatures within safe limits. However, the conditions that cause sustained thermal throttling (failed fans, blocked airflow, inadequate cooling infrastructure) can contribute to accelerated component aging if left unaddressed. The real concern is not the throttling - it is the throughput loss and the straggler effect on distributed training.
How do I check if a GPU's PCIe link has degraded?
Run nvidia-smi -q -d PCIE to see the current PCIe link generation and width versus the maximum supported. If "Current" shows Gen4 x8 while "Maximum" shows Gen4 x16, the link has degraded. In DCGM, monitor DCGM_FI_PROF_PCIE_TX_BYTES and DCGM_FI_PROF_PCIE_RX_BYTES and compare against the expected throughput for your link configuration. A sudden 50% drop in PCIe throughput is the characteristic signature of a x16 to x8 link degradation.
Should I replace a GPU that shows correctable ECC errors but no uncorrectable errors?
Not necessarily based on correctable errors alone. The decision should be based on the rate of error accumulation and the state of the GPU's repair mechanisms. If the SBE rate is stable and low (under 5 per day), the GPU is operating normally. If the SBE rate is accelerating (doubling week over week), row remapping is consuming spare rows, and page retirement is active, the GPU is on a degradation path and should be scheduled for replacement before the inevitable first DBE. Check DCGM_FI_DEV_ROW_REMAP_FAILURE - if this reads 1 (spare rows exhausted), replacement should be prioritized.
Monitor your GPU cluster with Factryze
Deploy autonomous agents that detect, diagnose, and optimize GPU infrastructure - in under 5 minutes.