Slurm

Open-source HPC workload manager scheduling GPU cluster jobs via srun, sbatch, and squeue.

What it is

Slurm is an open-source, highly scalable workload manager and job scheduler used in the majority of HPC and GPU training clusters. It provides job submission via sbatch (batch) and srun (interactive), queue inspection via squeue and sinfo, and GPU-aware scheduling through the GRES plugin that advertises GPU type, count, and MIG profiles as schedulable resources. Prolog and epilog scripts run automatically before and after each job, enabling automated GPU health checks between workloads.

Why it matters

Slurm's scheduling decisions determine which GPUs receive which workloads -- a node with early ECC degradation will keep receiving jobs until it is explicitly drained or fails. Without integration between GPU health telemetry and Slurm drain state, degraded hardware continues to receive new allocations. Topology-aware placement via --switches directives can mean the difference between peak and 25-30% degraded NCCL throughput for large jobs.

How to monitor

Monitor Slurm node state via sinfo and scontrol show node for DRAIN, DOWN, and ALLOCATED states. Watch for jobs stuck in PD (pending) state that may indicate gang scheduling deadlock. Factryze integrates directly with Slurm job lifecycle hooks, feeding real-time GPU health telemetry into drain and reservation decisions so degraded nodes are moved to DRAIN state before the scheduler assigns new workloads.

Related terms

Job Scheduling

Allocating GPU cluster resources using FIFO, fair-share, or priority-based policies.

Gang Scheduling

Atomic co-scheduling of all GPUs for distributed training requiring synchronized start.

Node Draining

Gracefully removing a node from scheduling via kubectl drain or Slurm DRAIN state.

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free