Gang Scheduling

Atomic co-scheduling of all GPUs for distributed training requiring synchronized start.

What it is

Gang scheduling ensures all resources required by a multi-GPU distributed training job -- GPUs, CPUs, memory, and network bandwidth -- are allocated atomically: either the entire resource set is available simultaneously or the job waits. Kubernetes implements this via Volcano (PodGroup with minMember constraints) or Kueue (Workload admission control). Slurm provides native gang scheduling through the sched/backfill plugin with future resource reservation via bf_max_job_start parameters.

Why it matters

NCCL communicator initialization (ncclCommInitRank) requires all ranks to be present at startup; any rank that fails to join within NCCL_TIMEOUT (default 180 seconds) aborts the entire job. Without gang scheduling, partial allocations create deadlocks -- a 128-GPU job may acquire 120 GPUs while the remaining 8 are held by another partial job, leaving 248 GPUs idle indefinitely. Gang scheduling violations are difficult to detect because stuck jobs still appear RUNNING in the scheduler.

How to monitor

Monitor NCCL communicator initialization log messages at startup to confirm all ranks joined successfully within NCCL_TIMEOUT. Watch for jobs that start but show near-zero GPU utilization (DCGM_FI_DEV_GPU_UTIL) across all ranks for more than a few minutes after launch, indicating a partial-start hang. Factryze monitors NCCL communicator initialization and immediately flags partial-start failures that indicate a gang scheduling violation.

Related terms

Job Scheduling

Allocating GPU cluster resources using FIFO, fair-share, or priority-based policies.

Slurm

Open-source HPC workload manager scheduling GPU cluster jobs via srun, sbatch, and squeue.

Topology-Aware Placement

Scheduling GPU jobs by NVLink domain, NUMA affinity, and network switch locality.

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free