GPU Monitoring Tools Compared: nvidia-smi vs DCGM vs Custom Solutions
Complete comparison of GPU monitoring tools for data center operations. nvidia-smi for quick checks, DCGM for production monitoring, and when you need autonomous agents.
Every GPU cluster operator starts with the same question: which GPU monitoring tool should I use? The answer depends on your scale, your operational maturity, and how much downtime you can tolerate. GPU monitoring tools range from nvidia-smi for quick terminal checks to NVIDIA DCGM for production-grade telemetry pipelines, and each tool fills a specific gap in the observability stack. This guide compares the three tiers of GPU monitoring tooling, walks through the tradeoffs, and helps you decide when dashboards stop being enough.
What Is nvidia-smi and When Should You Use It?
nvidia-smi (NVIDIA System Management Interface) is the CLI tool that ships with every NVIDIA driver installation. If you have ever SSH'd into a GPU node and typed nvidia-smi, you have already used it. The tool provides a snapshot of GPU state: utilization percentage, memory usage, temperature, power draw, running processes, and driver version.
What nvidia-smi Does Well
nvidia-smi excels at quick, interactive debugging. When a training job hangs, running nvidia-smi tells you in seconds whether the GPU is still alive, whether memory is exhausted, or whether utilization has dropped to zero. You can also use nvidia-smi dmon for continuous monitoring at a configurable sampling interval, or nvidia-smi pmon to see per-process GPU usage.
Common nvidia-smi commands for operations:
nvidia-smi- one-shot status snapshotnvidia-smi dmon -s pucvmet -d 1- continuous monitoring of power, utilization, clocks, violations, memory, encoder, temperature at 1-second intervalsnvidia-smi -q -d TEMPERATURE,POWER,CLOCK,ECC- detailed query of specific subsystemsnvidia-smi -pl 600- set power limit (useful for power capping)nvidia-smi -r- reset a GPU without rebooting the node
Where nvidia-smi Breaks Down
nvidia-smi is a polling tool. Every invocation spawns a process, queries the driver, and returns a text table. This model has three fundamental problems at scale.
First, polling overhead. Running nvidia-smi every 5 seconds across 1,000 GPUs means 12,000 process spawns per minute, each one contending for the GPU driver lock. At high polling frequencies, nvidia-smi itself can introduce measurable latency spikes on GPU operations.
Second, no persistent history. nvidia-smi shows you what is happening now, not what happened 30 minutes ago when the training job crashed. Without a time-series database behind it, you lose every data point the moment it scrolls off the terminal.
Third, text parsing fragility. Operators who script around nvidia-smi's tabular output with grep and awk discover that format changes between driver versions break their pipelines. The --query-gpu flag with CSV output is more stable, but you are still building a monitoring system from shell scripts.
For a single node or a handful of development GPUs, nvidia-smi is perfectly adequate. For anything resembling a production cluster, you need something designed for continuous telemetry collection.
What Is DCGM and How Does It Differ from nvidia-smi?
NVIDIA DCGM (Data Center GPU Manager) is the monitoring tool built specifically for data center GPU fleets. Unlike nvidia-smi, DCGM runs as a persistent daemon (nv-hostengine) that continuously collects GPU metrics through an efficient shared-memory interface, avoiding the per-query overhead that makes nvidia-smi unsuitable for high-frequency monitoring.
DCGM Field IDs: The Metric Vocabulary
DCGM organizes its telemetry around field IDs - over 200 numeric identifiers, each mapping to a specific GPU metric. The most operationally important field IDs include:
Utilization and performance:
DCGM_FI_DEV_GPU_UTIL- GPU streaming multiprocessor utilization (0-100%)DCGM_FI_DEV_MEM_COPY_UTIL- memory controller utilizationDCGM_FI_DEV_SM_CLOCK- current SM clock frequency in MHzDCGM_FI_DEV_MEM_CLOCK- current memory clock frequency
Thermal and power:
DCGM_FI_DEV_GPU_TEMP- GPU die temperature in CelsiusDCGM_FI_DEV_MEMORY_TEMP- HBM temperatureDCGM_FI_DEV_POWER_USAGE- current power draw in wattsDCGM_FI_DEV_CLOCK_THROTTLE_REASONS- bitmask of active throttle causes
Reliability and errors:
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL- volatile single-bit ECC errorsDCGM_FI_DEV_ECC_DBE_VOL_TOTAL- volatile double-bit ECC errorsDCGM_FI_DEV_XID_ERRORS- last Xid error codeDCGM_FI_DEV_PCIE_REPLAY_COUNTER- PCIe replay errorsDCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL- aggregate NVLink throughput
Memory health:
DCGM_FI_DEV_FB_USED/DCGM_FI_DEV_FB_FREE- framebuffer allocationDCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS- rows remapped due to correctable errorsDCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS- rows remapped due to uncorrectable errorsDCGM_FI_DEV_ROW_REMAP_FAILURE- whether row remapping has been exhausted
The full field ID reference is available in the NVIDIA DCGM documentation.
DCGM Diagnostic Levels
Beyond continuous monitoring, DCGM includes built-in hardware diagnostics that validate GPU health between jobs:
- Level 1 (Quick, ~30 seconds): Validates driver state and basic GPU responsiveness. Run this in Slurm prolog scripts between every job.
- Level 2 (Medium, ~2 minutes): Adds targeted stress tests for memory and compute paths.
- Level 3 (Extended, ~12+ minutes): Exhaustive memory and compute validation. Use this for qualifying GPUs after maintenance or RMA replacement.
These diagnostics catch degraded GPUs before they waste hours of training time - a capability that nvidia-smi simply does not offer.
How Does the DCGM + Prometheus + Grafana Stack Work?
The standard production monitoring stack for GPU clusters combines three open-source components into a telemetry pipeline.
How the Stack Works
DCGM Exporter is a lightweight container that queries DCGM field IDs at a configurable interval (default: 30 seconds) and exposes them as Prometheus-compatible metrics on an HTTP endpoint (typically port 9400). Each metric is labeled with GPU index, UUID, model name, and hostname, enabling per-GPU and per-node aggregation.
Prometheus scrapes the DCGM Exporter endpoint, stores the time-series data, and evaluates alerting rules. You define alert thresholds in PromQL - for example, DCGM_FI_DEV_GPU_TEMP > 85 to fire when any GPU exceeds 85 degrees Celsius.
Grafana provides the visualization layer: dashboards showing GPU utilization heatmaps, temperature trends, ECC error accumulation, and power consumption across the fleet.
Deploying the Stack
On a Kubernetes-based GPU cluster, the deployment looks like this:
- Deploy DCGM Exporter as a DaemonSet so it runs on every GPU node
- Configure the metrics CSV to include the field IDs you need (the default
dcp-metrics-included.csvcovers the essentials) - Add a Prometheus scrape config targeting the DCGM Exporter service
- Import or build Grafana dashboards for GPU fleet visibility
The resource overhead is minimal - DCGM Exporter uses roughly 50-100 MB of RAM per node and less than 5% CPU. For Slurm-managed clusters, DCGM Exporter runs as a systemd service instead of a DaemonSet, with the same Prometheus scraping model.
What You Get
With this stack operational, you have continuous GPU telemetry with configurable retention (days to months), threshold-based alerting through Prometheus Alertmanager, historical trend analysis for capacity planning, and per-GPU drill-down for incident investigation.
This is a significant upgrade over nvidia-smi. You now have history, automation, and visualization. For many teams, this stack is sufficient for the first 100-500 GPUs.
Where Does Dashboard-Based GPU Monitoring Break Down?
The DCGM + Prometheus + Grafana stack is the industry standard, and it works well for what it does. But it has structural limitations that become painful as clusters scale beyond a few hundred GPUs.
The Alert Fatigue Problem
Static threshold alerts generate noise at scale. Set GPU_TEMP > 85 and you get paged every time a GPU briefly spikes during a burst workload, even though it cools back down in 30 seconds. Set GPU_TEMP > 90 and you miss the gradual thermal creep that indicates a failing fan. There is no threshold that works for every GPU, every workload, and every ambient condition.
At 1,000 GPUs with 10 alert rules each, you are evaluating 10,000 conditions every scrape interval. The ops team quickly learns to ignore the dashboard, which defeats the purpose of monitoring.
Correlation Blindness
Dashboards show individual metrics in isolation. A GPU at 87 degrees, 95% utilization, and 680W power draw might be perfectly normal under a heavy training workload, or it might be a GPU that is thermally throttling because its neighbor's fan failed. The dashboard cannot tell the difference because it does not correlate metrics across GPUs, nodes, and environmental conditions.
The hardest failures to catch are the ones that only become visible when you correlate multiple signals. A slowly rising ECC error rate on one GPU, combined with slightly elevated temperature and a 3% drop in NVLink bandwidth, tells a story of imminent hardware failure. A dashboard shows three normal-looking graphs. An experienced SRE sees the pattern. But experienced SREs are asleep at 3 AM.
The Investigation Gap
When an alert fires, the real work begins. The on-call engineer must SSH into the node, check dmesg for Xid errors, query DCGM for detailed diagnostics, check if the GPU is part of a multi-node training job, determine whether to drain the node or reset the GPU, and document the incident. This investigation and diagnosis phase accounts for 60-70% of the total incident resolution time.
Dashboards are observation tools, not investigation tools. They tell you something is wrong. They do not tell you what to do about it.
When Do You Need Autonomous GPU Monitoring Agents?
The gap between "something is wrong" and "the problem is fixed" is where autonomous GPU monitoring agents operate. Instead of displaying metrics for a human to interpret, agents ingest the same DCGM telemetry and act on it directly.
What Changes with Autonomous Monitoring
Detection moves from static thresholds to multi-signal anomaly detection. Instead of alerting on GPU_TEMP > 85, an autonomous NOC agent compares each GPU's thermal profile against its peers, its own historical baseline, and the current workload intensity. A GPU running 5 degrees hotter than its siblings under the same workload is anomalous regardless of the absolute temperature value.
Diagnosis becomes automated cross-correlation. When the agent detects an anomaly, it does not page a human. It queries dmesg for Xid events, checks NVLink error counters, inspects PCIe replay rates, and correlates the timeline across all signals to identify root cause. The investigation that takes a human 20-30 minutes happens in seconds.
Remediation follows executable runbooks. Once root cause is identified, the SRE agent executes the appropriate fix: drain the node, reset the GPU, apply a power cap, or escalate to hardware replacement. Each action is logged, reversible, and validated with a post-action health check.
The result is MTTR compression from the industry average of 47 minutes down to under 2 minutes. Not because the fix is different, but because the wait time between detection, diagnosis, and action collapses to near zero.
The Monitoring Tool Maturity Curve
Most GPU operations teams follow a predictable progression:
- nvidia-smi - manual checks, works for 1-10 GPUs
- DCGM + Prometheus + Grafana - continuous telemetry with dashboards and alerts, works for 10-500 GPUs
- Autonomous agents - AI-driven detection, diagnosis, and remediation, necessary at 500+ GPUs
The jump from stage 2 to stage 3 is not about collecting more data. You already have all the data you need from DCGM. It is about what happens after the data is collected - replacing human investigation loops with autonomous action.
How Do You Choose the Right GPU Monitoring Tool?
The choice is not either/or. Each tool layer builds on the one below it.
| Capability | nvidia-smi | DCGM + Prometheus + Grafana | Autonomous Agents | |---|---|---|---| | Quick spot checks | Yes | Overkill | Overkill | | Continuous telemetry | No | Yes | Yes | | Historical trends | No | Yes | Yes | | Alert-based notification | No | Yes | Yes | | Multi-signal correlation | No | Manual | Automatic | | Automated diagnosis | No | No | Yes | | Automated remediation | No | No | Yes | | Scales past 500 GPUs | No | Barely | Yes |
If you are running a small research cluster, nvidia-smi and some scripting will get you far. If you are running production training or inference at moderate scale, the DCGM + Prometheus + Grafana stack is the right investment. If you are operating hundreds or thousands of GPUs where downtime costs thousands of dollars per minute, you need monitoring that acts, not just observes.
Factryze builds on the DCGM telemetry foundation with autonomous NOC, SRE, and Performance agents that close the loop from detection to resolution. You can explore how this works for your cluster on our pricing page or reach out to discuss your setup.
Frequently Asked Questions
Is nvidia-smi sufficient for monitoring a production GPU cluster?
No. nvidia-smi is a point-in-time polling tool that provides no history, no alerting, and no automation. It is useful for interactive debugging but creates overhead at high polling frequencies and loses data between queries. Production clusters need DCGM or an equivalent continuous telemetry pipeline that feeds into a time-series database.
How much overhead does DCGM Exporter add to GPU workloads?
DCGM Exporter uses approximately 50-100 MB of RAM and less than 5% CPU per node. The DCGM host engine (nv-hostengine) collects metrics through a shared-memory interface with negligible GPU overhead - far less than polling nvidia-smi at equivalent frequencies.
Can I use both nvidia-smi and DCGM together?
Yes. DCGM and nvidia-smi can run simultaneously since they both query the same underlying NVIDIA driver. In practice, teams use DCGM for continuous automated monitoring and nvidia-smi for ad-hoc debugging sessions when SSH'd into a specific node. The only consideration is to avoid high-frequency nvidia-smi polling on nodes where DCGM is already collecting telemetry, since both contend for the same driver lock.
When should I consider moving beyond dashboard-based monitoring?
The inflection point typically comes when alert fatigue sets in, when your on-call team spends more time investigating false positives than real incidents, or when your cluster grows past 200-500 GPUs. If your MTTR exceeds 30 minutes and the bottleneck is investigation time rather than fix time, that is a strong signal that you need automated diagnosis and remediation rather than more dashboard panels.
Monitor your GPU cluster with Factryze
Deploy autonomous agents that detect, diagnose, and optimize GPU infrastructure - in under 5 minutes.