Skip to main content

For AI/ML Labs

Catch the failures your dashboards miss.

Factryze runs autonomous agents inside your training and inference clusters. Deploys in your VPC alongside DCGM, Prometheus, and Grafana; telemetry stays in your network.

See the GPU Glossary

Where AI/ML labs lose time today

Xid 79 + Xid 48

Silent training degradation

A PCIe lane downtrains or a thermal throttle on one rank degrades an all-reduce. The job keeps running. The next checkpoint is 18% slower and nobody knows why for two days.

+18% checkpoint time2 days to detect
NCCL ring timeout

Faults that look fine until they aren't

Your existing tooling logs Xid 79, Xid 48, NCCL ring timeouts. Nobody on the team has time to triage the noise from the signal until a job actually crashes.

>Xid 79fallen off bus
>Xid 48DBE uncorrectable
>NCCLring timeout 5000ms
5-layer fault chain

On-call burnout from cross-layer faults

Hardware, driver, fabric, scheduler, training framework. Failures cross all five layers. The on-call engineer with the right context is asleep in a different timezone.

What changes when an agent is watching

Where Factryze sits in your stack

event streamexample · 2026-04-12
14:23:11GPU 4 (H100)
Xid 79: fallen off bus
14:23:11rank 4
NCCL ring timeout 5000ms
14:23:12GPU 4
PCIe x16 → x8 downtrained
14:23:13scheduler
job 4f72 paused
agentdrain node 04, swap to node 09, ticket FZ-417

Factryze reads from DCGM, NCCL, PCIe state, and scheduler events. Diagnoses the actual cause across all five layers. Recommends the fix; executes once approved.

<60s detection

Continuous failure detection

Every GPU, fabric link, and driver state read every few seconds. Existing DCGM, Prometheus, and Grafana stay in place.

DCGM read · 1Hz
NCCL probe · 1Hz
scheduler poll · 5Hz
5 layers correlated

Cross-layer root-cause

Correlates GPU thermals, PCIe link state, NCCL traces, and scheduler events. Identifies the actual cause, not the alert that fired loudest.

drain → swap → ticket

Recommend, then remediate

Human-in-the-loop by default. Approve once or set autopilot for known-safe runbooks.

drain node 04
swap workload to node 09
ticket FZ-417

Why not just an existing tool?

Detect Xid 79 / NCCL timeouts in seconds

DCGM Exporterpartial
Datadog GPUyes
NVIDIA Mission Controlyes
factryzeusyes

Cross-layer correlation (GPU + NCCL + PCIe + scheduler)

DCGM Exporterno
Datadog GPUpartial
NVIDIA Mission Controlpartial
factryzeusyes

Recommend the actual remediation step

DCGM Exporterno
Datadog GPUno
NVIDIA Mission Controlno
factryzeusyes

Execute the runbook (drain, swap, ticket)

DCGM Exporterno
Datadog GPUno
NVIDIA Mission Controlno
factryzeusyes

Deploys in your VPC with telemetry on-network

DCGM Exporteryes
Datadog GPUno
NVIDIA Mission Controlpartial
factryzeusyes
yeshandles nativelypartialrequires assemblynonot supported

Not ready for a call?

Currently piloting on NVIDIA H100 and A100 training clusters · per-GPU pricing

Want to stop losing checkpoints?

30 minutes with the founders. Bring your worst recent incident. We'll diagnose live and tell you whether we can help. No pitch until we've earned it.