Skip to main content
GPU Glossary/Operations
Operations

Runbook

Executable remediation procedures with conditional logic and approval gates for GPU issues.

What it is

A runbook is a structured, executable set of procedures for diagnosing and resolving specific GPU infrastructure issues, evolved from static wiki documents into programmatic templates with conditional logic (if ECC rate exceeds threshold then drain, else reset), variable interpolation (node hostname, GPU index, Xid code, DCGM field values injected at runtime), approval gates for destructive actions, and automatic rollback if a remediation step fails. Each step logs its inputs, outputs, and decision path for audit compliance and post-incident review.

Why it matters

Well-designed GPU runbooks encode tribal knowledge that otherwise lives only in senior engineers' heads -- the correct NVLink escalation sequence, the specific order of operations for row remapping exhaustion, or the precise DCGM fields to check before declaring a GPU safe for return to service. Without runbooks, every incident depends on the on-call engineer's experience level and current context. Runbooks convert 40-minute expert-dependent triage into a 90-second automated response for well-understood failure patterns.

How to monitor

Track runbook execution success rates, step failure points, and rollback frequency to identify which procedures need refinement. Audit logs per step enable post-incident review and compliance documentation. Factryze's SRE Agent executes runbooks autonomously for well-understood failure patterns (checking DCGM_FI_DEV_ROW_REMAP_FAILURE, DCGM_FI_DEV_RETIRED_PENDING, ECC counters) while routing novel or high-risk scenarios through approval gates to human operators.

Runbook - Automated Remediation with Approval GatesRunbook - Automated Remediation with Approval Gates
Pinch to zoom, drag to pan, double-tap to toggle
Runbook - Automated Remediation with Approval GatesRunbook - Automated Remediation with Approval Gates

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free