Slurm
Open-source HPC workload manager scheduling GPU cluster jobs via srun, sbatch, and squeue.
What it is
Slurm is an open-source, highly scalable workload manager and job scheduler used in the majority of HPC and GPU training clusters. It provides job submission via sbatch (batch) and srun (interactive), queue inspection via squeue and sinfo, and GPU-aware scheduling through the GRES plugin that advertises GPU type, count, and MIG profiles as schedulable resources. Prolog and epilog scripts run automatically before and after each job, enabling automated GPU health checks between workloads.
Why it matters
Slurm's scheduling decisions determine which GPUs receive which workloads -- a node with early ECC degradation will keep receiving jobs until it is explicitly drained or fails. Without integration between GPU health telemetry and Slurm drain state, degraded hardware continues to receive new allocations. Topology-aware placement via --switches directives can mean the difference between peak and 25-30% degraded NCCL throughput for large jobs.
How to monitor
Monitor Slurm node state via sinfo and scontrol show node for DRAIN, DOWN, and ALLOCATED states. Watch for jobs stuck in PD (pending) state that may indicate gang scheduling deadlock. Factryze integrates directly with Slurm job lifecycle hooks, feeding real-time GPU health telemetry into drain and reservation decisions so degraded nodes are moved to DRAIN state before the scheduler assigns new workloads.
Related terms
Allocating GPU cluster resources using FIFO, fair-share, or priority-based policies.
Atomic co-scheduling of all GPUs for distributed training requiring synchronized start.
Gracefully removing a node from scheduling via kubectl drain or Slurm DRAIN state.
Monitor this automatically
Factryze correlates GPU signals in real time: errors, clocks, and fabric health.
Get Started Free