How does Factryze detect silent training degradation?

Every GPU, fabric link, and driver state is read every few seconds. Anomalies surface within seconds, not days. Existing DCGM, Prometheus, and Grafana stay in place.

Which GPU error codes does Factryze monitor?

Xid 79 (fallen off bus), Xid 48 (DBE uncorrectable), NCCL ring timeouts, PCIe lane downtraining events, and thermal throttle conditions across all ranks in a training cluster.

How does Factryze diagnose cross-layer faults?

The agent correlates GPU thermals, PCIe link state, NCCL traces, and scheduler events to identify the actual cause, not the alert that fired loudest.

Does Factryze remediate automatically?

Human-in-the-loop by default. The agent shows the fix it would apply (drain node 04, swap workload to node 09, file maintenance ticket). Approve once, or set autopilot for known-safe runbooks.

For AI/ML Labs

Catch the failures your dashboards miss.

Factryze runs autonomous agents inside your training and inference clusters. Deploys in your VPC alongside DCGM, Prometheus, and Grafana; telemetry stays in your network.

Talk to Founders

See the GPU Glossary

Where AI/ML labs lose time today

Xid 79 + Xid 48

Silent training degradation

A PCIe lane downtrains or a thermal throttle on one rank degrades an all-reduce. The job keeps running. The next checkpoint is 18% slower and nobody knows why for two days.

+18% checkpoint time2 days to detect

NCCL ring timeout

Faults that look fine until they aren't

Your existing tooling logs Xid 79, Xid 48, NCCL ring timeouts. Nobody on the team has time to triage the noise from the signal until a job actually crashes.

>Xid 79fallen off bus

>Xid 48DBE uncorrectable

>NCCLring timeout 5000ms

5-layer fault chain

On-call burnout from cross-layer faults

Hardware, driver, fabric, scheduler, training framework. Failures cross all five layers. The on-call engineer with the right context is asleep in a different timezone.

What changes when an agent is watching

Where Factryze sits in your stack

event streamexample · 2026-04-12

14:23:11GPU 4 (H100)

Xid 79: fallen off bus

14:23:11GPU 4 (H100)Xid 79: fallen off bus

14:23:11rank 4

NCCL ring timeout 5000ms

14:23:11rank 4NCCL ring timeout 5000ms

14:23:12GPU 4

PCIe x16 → x8 downtrained

14:23:12GPU 4PCIe x16 → x8 downtrained

14:23:13scheduler

job 4f72 paused

14:23:13schedulerjob 4f72 paused

agent→drain node 04, swap to node 09, ticket FZ-417

Factryze reads from DCGM, NCCL, PCIe state, and scheduler events. Diagnoses the actual cause across all five layers. Recommends the fix; executes once approved.

<60s detection

Continuous failure detection

Every GPU, fabric link, and driver state read every few seconds. Existing DCGM, Prometheus, and Grafana stay in place.

→DCGM read · 1Hz

→NCCL probe · 1Hz

→scheduler poll · 5Hz

5 layers correlated

Cross-layer root-cause

Correlates GPU thermals, PCIe link state, NCCL traces, and scheduler events. Identifies the actual cause, not the alert that fired loudest.

drain → swap → ticket

Recommend, then remediate

Human-in-the-loop by default. Approve once or set autopilot for known-safe runbooks.

→drain node 04

→swap workload to node 09

→ticket FZ-417

Why not just an existing tool?

capability	DCGM Exporter	Datadog GPU	NVIDIA Mission Control	factryzeus
Detect Xid 79 / NCCL timeouts in seconds	partial	yes	yes	yes
Cross-layer correlation (GPU + NCCL + PCIe + scheduler)	no	partial	partial	yes
Recommend the actual remediation step	no	no	no	yes
Execute the runbook (drain, swap, ticket)	no	no	no	yes
Deploys in your VPC with telemetry on-network	yes	no	partial	yes

Detect Xid 79 / NCCL timeouts in seconds

DCGM Exporterpartial

Datadog GPUyes

NVIDIA Mission Controlyes

factryzeusyes

Cross-layer correlation (GPU + NCCL + PCIe + scheduler)

DCGM Exporterno

Datadog GPUpartial

NVIDIA Mission Controlpartial

factryzeusyes

Recommend the actual remediation step

DCGM Exporterno

Datadog GPUno

NVIDIA Mission Controlno

factryzeusyes

Execute the runbook (drain, swap, ticket)

DCGM Exporterno

Datadog GPUno

NVIDIA Mission Controlno

factryzeusyes

Deploys in your VPC with telemetry on-network

DCGM Exporteryes

Datadog GPUno

NVIDIA Mission Controlpartial

factryzeusyes

yeshandles nativelypartialrequires assemblynonot supported

Not ready for a call?

Read the architecture briefJoin the design partner program

Currently piloting on NVIDIA H100 and A100 training clusters · per-GPU pricing

Want to stop losing checkpoints?

30 minutes with the founders. Bring your worst recent incident. We'll diagnose live and tell you whether we can help. No pitch until we've earned it.

Talk to Founders