Silent Data Corruption

Every other failure mode in this chapter announces itself. ECC counters spike, kernel logs emit Xid codes, NVLink retry storms saturate counters, GPUs fall off the bus and processes exit. Silent data corruption is the failure that does none of this. The GPU answers the question. The answer is wrong.

The detection spectrum

Failures sit on a spectrum from loud to silent. Loud failures are cheap to handle: the operating system gets a signal, the runbook fires, the bad node is drained. Quiet failures take more care: someone has to be watching the right counter. Silent failures are categorically different. There is no signal to watch. The corruption can sit in a model checkpoint for weeks before the divergence shows up downstream as wrong loss curves or unstable inference.

The reason SDC is a distinct category, not just "ECC failed harder," is that it bypasses the error-detection circuitry entirely. A bit flip in a flip-flop on the silicon path between two functional units can produce an arithmetically wrong result that never touches DRAM, never raises a parity event, and never crosses an ECC boundary. The output is structurally correct (no NaN, no infinity) and statistically plausible (no obvious outlier). Wrong, and undetectable from the result alone.

Field rates from large operators

Public data is sparse, but the headline numbers are consistent. Meta's published study at scale (the SDC study from 2021 onwards) puts the rate of silent corruption events at roughly one per 1,000 GPU-years, with significant per-chip variance: most GPUs never produce one, a small tail produces several. Google, Microsoft, and the major HPC labs report comparable numbers when they publish them at all.

At first glance, "one per 1,000 GPU-years" sounds rare. At fleet scale it is not:

expected_SDC_per_year ≈ N / 1000
 
  N = 1024  →  ~1 SDC event per year
  N = 8192  →  ~8 SDC events per year
  N = 25000 →  ~25 SDC events per year

Frontier-scale training runs span multiple weeks at fleets above 8K GPUs. Multiple SDC events per training run are the expectation, not the exception.

What happens when an SDC hits training

The most common failure pattern: a single matmul produces a wrong gradient on rank K. The wrong gradient enters the all-reduce. Every other rank receives the corrupted average. The optimizer applies the corrupted gradient. The model parameters drift slightly off the true minimum. Loss does not spike (the corruption is small relative to the loss surface). Training continues. Convergence eventually slows or destabilizes hours or days later, far from the original event.

By the time the symptom shows, every checkpoint after the corruption is contaminated. Recovery means going back to the last known-good checkpoint, which may be hours or longer behind, and replaying.

A worst-case pattern: SDC in one of the optimizer's running stats (m, v in Adam) produces a corruption that is small per step but compounds. Loss curves look fine for thousands of steps, then unwind. Diagnosis usually requires bisection over checkpoints.

Detection strategies

There is no single solution. Practical fleets stack three:

Replication for critical operations. Run the same matmul on two GPUs and compare the outputs. If they differ, you have an SDC event. The cost is high (50% throughput tax on every replicated op), so this is reserved for evaluation steps, gradient checks at known intervals, or critical control-plane paths.
Numerical sanity checks. Track per-step loss, per-step gradient norm, and per-step parameter norm. An SDC event often shows as an outlier in one of these signals before the model itself diverges. Not all SDCs trigger this, but many do, and the cost is a few extra reductions per step.
Periodic deterministic replay. On a fixed cadence, re-run a known input through a known checkpoint on each GPU and compare against a reference. Discrepancies are SDC. This is expensive but reliable; large operators run it as a maintenance job between training runs, not during them.

None of the three catches every event. SDC is the failure mode where the right operational answer is not "detect every event" but "make recovery cheap when an event slips through": frequent checkpoints, careful versioning, and the ability to bisect across a window of suspect work.

What does not work

A few approaches sound right and do not work in practice:

Cranking up ECC verbosity. SDC bypasses ECC by definition. More ECC visibility helps with corrupted DRAM events, which are not SDC.
Trusting CRC on the wire. NVLink and InfiniBand carry CRC, which protects in-flight data, not in-flight computation. The corruption is upstream of the link, not on it.
Software RTL-style assertions. Adding assert(value > 0) style checks to training loops catches some corruption but only the variety where the corruption produces an obviously-wrong type. Most SDC produces values that are wrong but valid.

The correct framing is: assume SDC happens, design the workflow so that the cost when it does is bounded.

What Factryze does about it

The honest position: SDC is the hardest single failure mode in fleet operations, and no monitoring product (ours included) can detect a true silent event from telemetry alone. What Factryze can do, and does, is correlate the indirect signals (loss curve drift, gradient-norm outliers, fault-domain-localized step-time anomalies) into the fastest possible suspect-window for bisection. That cuts the time from "something is wrong" to "we know which checkpoint and which rank to examine" from days to hours. The SDC event still has to be confirmed by replication or replay; the platform is what makes the suspect window small enough to be tractable.