How Factryze GPU Monitoring Agents Work: Architecture Deep-Dive
Technical deep-dive into how Factryze's NOC, SRE, and Performance agents monitor DCGM metrics, diagnose GPU failures, and optimize cluster utilization in real-time.
Factryze uses a multi-agent architecture where three specialized AI agents - NOC, SRE, and Performance - work together to monitor, diagnose, and optimize GPU infrastructure autonomously. This article is a technical deep-dive into how each agent works, what metrics they consume, and how they coordinate to keep your GPU clusters healthy.
If you're new to autonomous GPU monitoring, start with our overview: Why GPU Infrastructure Needs Autonomous Agents.
How Is Factryze's Agent Architecture Structured?
The Factryze platform consists of three layers: a data collection layer, a metrics storage and processing layer, and the agent layer. At the base, a lightweight Go agent runs on each GPU node in your cluster. This agent weighs in at under 15 MB, consumes negligible CPU and memory, and runs as a systemd service that starts automatically on boot. Its job is simple: collect GPU metrics via NVIDIA DCGM and forward them to the platform's ingestion service.
The Go agent uses DCGM's API to sample GPU metrics at configurable intervals - typically every 5 seconds for health metrics and every 30 seconds for performance counters. It also collects host-level context: CPU utilization, memory pressure, disk I/O, and network throughput from the node itself. This host context is critical because GPU issues frequently have root causes outside the GPU - a saturated PCIe bus, a memory-starved host, or a failing NIC can all manifest as GPU performance degradation.
All collected data flows to the Factryze backend, which runs entirely on-prem behind your firewall. There is no cloud dependency and no data egress. The backend stores metrics in VictoriaMetrics, a high-performance time-series database that handles the write throughput of large GPU clusters without breaking a sweat. The three AI agents - NOC, SRE, and Performance - query this metrics store via internal APIs to perform their respective functions. They never read raw metric streams directly; instead, they consume pre-indexed, queryable data that allows them to perform complex cross-metric and cross-node correlations efficiently.
Why Three Agents Instead of One
The decision to use three specialized agents rather than one general-purpose system is grounded in a fundamental principle: detection, diagnosis, and optimization are different problems that demand different architectures.
Detection - the NOC Agent's domain - requires continuous, low-latency scanning of every metric stream against both static thresholds and learned baselines. It needs to be fast and comprehensive, processing thousands of metric samples per second with minimal delay. This is a streaming analytics problem.
Diagnosis - the SRE Agent's domain - requires deep, contextual reasoning across multiple data sources. When the NOC Agent raises an alert, the SRE Agent needs to query historical data, correlate metrics across layers (hardware, driver, OS, workload), and match patterns against a knowledge base of known failure modes. This is a search and reasoning problem that benefits from a different computational approach than real-time streaming.
Optimization - the Performance Agent's domain - requires long-horizon analysis of utilization patterns, scheduling efficiency, and resource allocation. It operates on time windows of hours to days, looking for structural inefficiencies rather than acute failures. This is an analytics and planning problem. Trying to build one agent that excels at all three would result in a system that compromises on latency, depth, and time horizon simultaneously.
Data Flow: From GPU to Action
The metric pipeline follows a well-defined path. On each node, the Go agent queries DCGM through its gRPC API and packages the metrics into a compact wire format. These metric batches are sent to the Factryze ingestion service over a persistent connection, which handles deduplication, timestamping, and schema validation before writing to VictoriaMetrics.
Once metrics land in VictoriaMetrics, they become available to all three agents via an internal query API that supports both instant queries (current value of a metric) and range queries (metric values over a time window). The agents also have access to a metadata store that tracks node topology, GPU hardware specifications, driver versions, and workload assignments. This metadata is essential for contextual diagnosis - knowing that GPU 4 on node-17 is an H100 SXM5 running CUDA 12.4 with driver 550.54 is often the difference between a correct diagnosis and a wrong one.
When an agent determines that action is needed - whether that is raising an alert, executing a runbook, or issuing an optimization recommendation - it publishes the action to an event bus that other agents can subscribe to. This is how agents coordinate without tight coupling: the NOC Agent publishes detection events, the SRE Agent subscribes to those events and publishes diagnosis results, and the Performance Agent consumes both to inform its optimization decisions.
How Does the NOC Agent Monitor GPU Health?
The NOC Agent is the first line of defense. It runs continuously, scanning every GPU metric stream across your entire fleet, and its sole purpose is to detect anomalies as quickly and accurately as possible. It is designed for two properties above all else: low detection latency and low false-positive rate. A missed anomaly means GPU downtime; a false positive means alert fatigue and eroded trust in the system.
The NOC Agent operates in two modes simultaneously. The first is threshold-based monitoring, where metrics are compared against configurable upper and lower bounds. The second is baseline-deviation monitoring, where the agent maintains a rolling statistical model of each metric's expected behavior and flags deviations that exceed a configurable sensitivity level. Both modes run in parallel, and either can trigger an alert independently.
What the NOC Agent Monitors
The NOC Agent consumes the full range of NVIDIA DCGM metrics, organized into several categories:
Thermal metrics: DCGM_FI_DEV_GPU_TEMP (GPU core temperature), DCGM_FI_DEV_MEMORY_TEMP (HBM memory temperature). These are sampled at 5-second intervals and tracked both as absolute values and as rates of change. A GPU that is at 78 degrees C and stable is very different from one that is at 78 degrees C and climbing at 2 degrees per minute.
Utilization metrics: DCGM_FI_DEV_GPU_UTIL (SM utilization), DCGM_FI_DEV_MEM_COPY_UTIL (memory controller utilization), DCGM_FI_PROF_GR_ENGINE_ACTIVE (graphics engine activity). These metrics reveal whether the GPU is actually doing useful work or sitting idle despite being allocated.
Memory metrics: DCGM_FI_DEV_FB_USED (framebuffer memory used), DCGM_FI_DEV_FB_FREE (framebuffer memory free). Memory pressure is one of the most common causes of training job failures, and early detection prevents OOM kills.
Reliability metrics: DCGM_FI_DEV_ECC_SBE_VOL_TOTAL (volatile single-bit ECC errors), DCGM_FI_DEV_ECC_DBE_VOL_TOTAL (volatile double-bit ECC errors), DCGM_FI_DEV_RETIRED_SBE (retired pages due to single-bit errors), DCGM_FI_DEV_RETIRED_DBE (retired pages due to double-bit errors). ECC errors are the earliest indicator of GPU memory degradation and often precede hard failures by hours or days.
Power metrics: DCGM_FI_DEV_POWER_USAGE (current power draw), DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION (cumulative energy). Power draw anomalies can indicate both hardware issues (a GPU drawing more power than expected under a given workload) and cooling issues (power throttling engaging to prevent thermal damage).
Interconnect metrics: DCGM_FI_DEV_PCIE_TX_THROUGHPUT and DCGM_FI_DEV_PCIE_RX_THROUGHPUT (PCIe bandwidth), NVLink counters where available. Interconnect degradation is particularly insidious because it slows distributed training without triggering any GPU-level error.
Alert Rules and Thresholds
The NOC Agent ships with sensible default thresholds for all monitored metrics, but every threshold is fully configurable. Temperature alerts trigger at 85 degrees C by default (warning) and 90 degrees C (critical). ECC single-bit error alerts trigger when the rate exceeds 10 errors per hour. Double-bit ECC errors trigger an immediate critical alert on any occurrence, since they indicate uncorrectable memory corruption.
Crucially, the NOC Agent implements cooldown periods and deduplication logic to prevent alert storms. When a single root cause - say, a cooling system failure affecting an entire rack - triggers temperature alerts on 16 GPUs simultaneously, the agent groups these into a single correlated alert rather than firing 16 independent pages. Cooldown periods prevent a flapping metric from generating repeated alerts; once an alert fires, the same condition will not re-alert for a configurable window (default: 15 minutes) unless the severity escalates.
Anomaly Detection Beyond Simple Thresholds
Static thresholds catch obvious failures, but many GPU issues manifest as subtle deviations from normal behavior that never cross a fixed boundary. A GPU that normally runs at 72% utilization during training hours but drops to 65% has not crossed any critical threshold, yet that 7-point drop might indicate NVLink degradation that is causing synchronization delays.
The NOC Agent addresses this through baseline-deviation detection. For each metric on each GPU, the agent maintains a rolling statistical profile that captures the metric's typical behavior across different time periods (time of day, day of week) and different workload states (training, inference, idle). When the current metric value deviates from the expected baseline by more than a configurable number of standard deviations, the agent flags it as an anomaly.
The agent also performs cross-GPU correlation. If all 8 GPUs in a node simultaneously show a utilization drop, the cause is likely node-level (host CPU saturation, network issue, storage bottleneck) rather than GPU-specific. If only one GPU in an 8-GPU node shows degraded performance while the others are fine, the cause is likely GPU-specific (hardware degradation, driver issue). This cross-GPU context is attached to every alert, giving the SRE Agent a significant head start on diagnosis.
How Does the SRE Agent Diagnose and Fix GPU Issues?
The SRE Agent activates when the NOC Agent raises an alert. Its job is to answer two questions: "What is the root cause?" and "What should we do about it?" These are the questions that consume the majority of human engineering time in traditional incident response, and automating them is where the largest MTTR gains come from.
The SRE Agent approaches diagnosis as a structured search problem. Given an alert (a specific metric anomaly on a specific GPU at a specific time), it systematically queries related metrics, checks correlated systems, and matches the observed pattern against a knowledge base of known failure modes. This is not a keyword search or a simple rule lookup - it is a multi-step reasoning process that considers the full context of the alert.
Cross-Layer Signal Correlation
GPU failures rarely have single-layer root causes. A training job crash might be triggered by a GPU memory error, but the memory error might be caused by sustained thermal stress, which itself might be caused by a failed fan, a blocked air vent, or simply a workload that exceeds the GPU's thermal design power for an extended period. Diagnosing the actual root cause requires correlating signals across four layers:
Hardware layer: DCGM metrics (temperature, power, ECC errors, clock speeds), IPMI data (fan speeds, inlet/outlet temperatures, PSU status), and physical topology (which rack, which PDU, which cooling zone).
Driver layer: GPU driver version, CUDA version, known driver bugs for specific GPU SKUs, driver error logs, and Xid error codes. Certain Xid codes (like Xid 79, GPU fallen off the bus) point directly to hardware failure, while others (like Xid 31, GPU memory page fault) may indicate software issues.
OS layer: kernel logs, PCIe link state, IOMMU configuration, NUMA topology, and host resource pressure (CPU, memory, disk I/O). A GPU on a NUMA node whose associated CPU cores are saturated will show degraded performance even if the GPU itself is healthy.
Workload layer: which job is running on the affected GPU, what framework (PyTorch, JAX, TensorFlow), what model architecture, what batch size, and what phase of training. Some failure modes are workload-dependent - large batch sizes with mixed precision can trigger ECC errors on GPUs that run fine with smaller batches.
The SRE Agent queries all four layers in parallel, builds a unified timeline of events, and identifies the causal chain that best explains the observed anomaly.
Runbook Execution
Once the SRE Agent identifies a root cause, it checks its runbook library for a matching remediation procedure. Runbooks are template-based action sequences with conditional logic, variable substitution, and approval gates.
A typical runbook for an ECC error remediation might include these steps: verify that the ECC error count is still rising (to rule out transient spikes), check whether the affected GPU is running a critical workload, gracefully drain the workload to another GPU if available, reset the GPU's ECC error counters, run a short diagnostic stress test, and either return the GPU to service or flag it for physical maintenance based on the stress test result.
Importantly, destructive or high-impact actions - node reboots, GPU resets, workload termination - require explicit approval gates. The SRE Agent will not reboot a node without either a pre-configured approval policy (e.g., "allow automatic reboot of non-production nodes during maintenance windows") or explicit human confirmation via the notification channel. The agent presents its full diagnosis and proposed action plan, and the human simply approves or rejects. This preserves human oversight for high-stakes actions while still eliminating the hours of manual diagnosis that precede them.
Every runbook execution is logged with full context: what triggered it, what the diagnosis was, what steps were executed, what the outcome was, and how long each step took. This execution history serves as both an audit trail and a training dataset for improving future diagnoses.
Example: Diagnosing an ECC Error Cascade
To illustrate the full detection-to-remediation pipeline, consider this real-world scenario. The NOC Agent detects a rising rate of single-bit ECC errors on GPU 4 in node-17. The error rate has increased from a baseline of 0-1 errors per hour to 23 errors in the last 15 minutes.
The SRE Agent activates and begins its cross-layer investigation. First, it queries thermal data and finds that GPU 4's temperature has been steadily climbing over the past two hours, currently sitting at 83 degrees C - below the critical threshold but well above its historical baseline of 74 degrees C for this workload. Next, it checks the other GPUs on the same node and finds that GPUs 3 and 5 (physically adjacent) are also running warmer than usual, suggesting an environmental cause rather than a GPU-specific defect.
The agent then correlates with IPMI data and discovers that two of the four chassis fans are reporting lower RPM than expected. It checks the workload layer and finds that the node is running a large-batch mixed-precision training job that pushes all 8 GPUs to sustained 95% utilization - a workload that generates significant heat.
The diagnosis: the combination of elevated workload intensity and degraded cooling capacity (partial fan failure) has pushed GPU 4 past its thermal comfort zone, causing ECC errors in thermally stressed HBM memory. GPUs 3 and 5 are trending in the same direction and will likely begin experiencing ECC errors within the next hour if nothing changes.
The SRE Agent executes a runbook: it migrates the training workload's GPU 4 partition to a spare GPU on another node, reduces the power limit on GPUs 3 and 5 by 15% as a temporary thermal mitigation measure, and flags node-17 for physical maintenance to address the fan issue. Total time from first ECC error detection to completed remediation: 94 seconds.
How Does the Performance Agent Optimize GPU Clusters?
The Performance Agent operates on a different timescale and with a different objective than the NOC and SRE agents. While those agents focus on detecting and fixing problems, the Performance Agent focuses on maximizing the value you extract from healthy GPUs. It analyzes utilization patterns across the entire cluster over time windows of hours to days, identifying structural inefficiencies that individually might seem minor but collectively can leave 30-40% of your GPU capacity on the table.
The Performance Agent does not respond to alerts. It runs continuously in the background, building and updating an optimization model of your cluster. When it identifies an actionable improvement, it publishes a recommendation - or, for low-risk optimizations that have been pre-approved, executes the change automatically.
Utilization Pattern Analysis
The most common source of wasted GPU capacity is not idle GPUs - it is GPUs that are allocated to jobs but underutilized. A data-parallel training job might allocate 8 GPUs but spend 30% of wall-clock time in data loading, gradient synchronization, or checkpoint writing, during which the GPUs are largely idle. The Performance Agent detects these patterns by analyzing DCGM_FI_DEV_GPU_UTIL and DCGM_FI_PROF_GR_ENGINE_ACTIVE over the duration of each job.
Once the agent identifies a utilization pattern - for example, a recurring 8-second idle gap every 45 seconds during a training loop - it can recommend specific changes. These might include increasing the number of data loader workers, enabling asynchronous gradient all-reduce, adjusting prefetch buffers, or co-scheduling a smaller inference workload during the idle gaps. The agent does not guess; it bases its recommendations on the observed metric patterns and the specific workload characteristics.
Memory Fragmentation Detection
GPU memory fragmentation is an underappreciated source of inefficiency. When jobs allocate and free GPU memory in varying block sizes over time, the framebuffer can become fragmented - plenty of free memory in total, but no contiguous block large enough to satisfy a new allocation. The result is that a GPU with 30 GB of free memory (out of 80 GB total on an H100) might fail to allocate a 20 GB tensor because the free memory is scattered across dozens of small fragments.
The Performance Agent detects fragmentation by monitoring the relationship between DCGM_FI_DEV_FB_FREE (total free memory) and actual allocation success rates. When it detects a growing disparity - increasing free memory alongside increasing allocation failures - it flags the GPU as fragmented. It can then recommend or execute a memory compaction step during a natural break in the workload, or adjust job scheduling to reduce fragmentation-causing allocation patterns.
Thermal Throttling Prevention
Most GPU clusters configure static power limits as a safety measure against thermal throttling. A common approach is to set the power limit 15-20% below the GPU's maximum TDP, ensuring that even worst-case workloads will not trigger thermal throttling. This is safe but wasteful - it means every GPU in the cluster is permanently running at reduced performance, even when thermal conditions are favorable.
The Performance Agent replaces static power limits with dynamic thermal management. By monitoring DCGM_FI_DEV_GPU_TEMP, DCGM_FI_DEV_POWER_USAGE, and ambient temperature data in real-time, it adjusts power targets per GPU based on current thermal headroom. A GPU in a well-cooled rack running a moderate workload might be allowed to run at 95% of TDP, while a GPU in a warmer environment running a thermal-intensive workload might be restricted to 80%. The result is higher sustained aggregate throughput across the cluster, with the same or lower incidence of thermal throttling events compared to static limits.
How Do the Three Agents Coordinate?
The three agents communicate through a shared event bus and a priority system that prevents duplicate work and ensures the right agent handles each situation. The flow follows a natural escalation pattern: the NOC Agent detects and publishes alerts, the SRE Agent subscribes to those alerts and publishes diagnoses and remediation actions, and the Performance Agent subscribes to both alert and remediation events to adjust its optimization model.
Priority levels ensure that agent actions do not conflict. When the SRE Agent is actively remediating an issue on a specific GPU, the Performance Agent will not attempt to modify power limits or workload assignments on that GPU until the remediation is complete. Similarly, if the Performance Agent has scheduled a memory compaction operation, the NOC Agent will adjust its anomaly baselines for that GPU to account for the expected metric changes during compaction.
This coordination is event-driven and loosely coupled. Each agent operates independently and makes its own decisions based on its own data and logic. The event bus provides awareness of what other agents are doing, not control over their behavior. This design ensures that a failure or restart of one agent does not affect the others - the NOC Agent continues detecting anomalies even if the SRE Agent is temporarily unavailable.
How Do You Deploy Factryze in Under 5 Minutes?
Deploying Factryze on your GPU cluster requires no changes to your existing infrastructure. The Go agent is distributed as a single static binary - no runtime dependencies, no containers required for the agent itself, no Python environments to manage. Installation is a one-line command:
curl -sfL https://get.factryze.ai | sh -s -- --server <your-factryze-instance>
The installer downloads the agent binary, creates a systemd service, and starts the agent. On first run, the agent auto-discovers all GPUs on the node via DCGM and registers them with your Factryze instance. Within 60 seconds of installation, GPU metrics are flowing and all three agents are active.
The Factryze backend itself runs as a set of Docker containers that you deploy on a management node within your network. The entire stack - backend services, VictoriaMetrics, agent coordination layer - runs behind your firewall with zero external dependencies. For air-gapped environments, we provide offline installation packages.
Factryze offers a free tier for small clusters to get you started with zero commitment. For larger deployments, visit our pricing page for details. To begin your deployment, reach out through our getting started page.
Frequently Asked Questions
What data does the Factryze agent collect?
The agent collects GPU metrics via NVIDIA DCGM, including temperature, utilization, memory usage, ECC errors, power draw, and PCIe/NVLink bandwidth. It also collects host-level context such as CPU and memory utilization. All data stays within your network - the agent communicates only with your on-prem Factryze instance, and no telemetry is sent externally.
Does Factryze work with Kubernetes and SLURM?
Yes. The Go agent runs on bare metal, Kubernetes nodes, and SLURM-managed clusters. It auto-discovers GPUs regardless of the orchestration layer and tags metrics with orchestrator-specific metadata (pod names in Kubernetes, job IDs in SLURM) for richer context during diagnosis.
Can I run Factryze without internet access?
Yes. Factryze is designed for air-gapped deployments. The entire stack - backend, metrics database, and agents - runs in Docker containers behind your firewall with zero external dependencies. Offline installation packages are available for environments where even the initial download must happen without internet access.
How does Factryze compare to Prometheus and Grafana?
Factryze complements rather than replaces Prometheus and Grafana. Those tools excel at metric collection, storage, and visualization. Factryze adds the autonomous detection, diagnosis, and remediation layer on top. Many teams run both: Prometheus and Grafana for dashboards and capacity planning, Factryze for automated incident response and optimization.
What GPUs does Factryze support?
Factryze supports any NVIDIA GPU that is compatible with DCGM, which includes the entire data center GPU lineup: A100, H100, H200, B100, B200, and their variants. The agent automatically detects the GPU model and adjusts its monitoring profiles, thresholds, and baselines accordingly.
Monitor your GPU cluster with Factryze
Deploy autonomous agents that detect, diagnose, and optimize GPU infrastructure - in under 5 minutes.