Slurm Gang Scheduling

Multi-node training jobs are not normal jobs. A 64-rank PyTorch job does not start when one rank starts; it starts when every rank starts. Until rank 63 issues torch.distributed.init_process_group, ranks 0 through 62 sit in MPI_Init (or NCCL's bootstrap) waiting for the rendezvous to complete. If the scheduler hands out one rank at a time, ranks that started early sit idle while later ranks are still queued. Worse, if the scheduler hands out only some ranks and others fail to land, the job hangs forever and the partial allocation runs out the wall-clock budget.

Gang scheduling is the rule that fixes this: do not start any rank until every rank can start.

What gang scheduling actually does

In Slurm, gang scheduling is a SchedulerType=sched/gang plugin (or behavior baked into sched/backfill with Time-slicing knobs). When a job arrives requesting N nodes, Slurm holds the request until N nodes are simultaneously available, then dispatches all N rank processes at once via srun. The scheduling decision is binary: either the gang starts in full, or it waits.

The cost: idle time on the partial reservation. If 7 of 8 nodes are free and the 8th is finishing another job, those 7 GPUs sit idle waiting. The scheduler's bookkeeping treats them as reserved (no other job can claim them) but unproductive. For short jobs this is cheap; for jobs that wait minutes for the gang to materialize, the idle cost can be a large fraction of the wall clock. The benefit: zero deadlocks, predictable startup, and the job actually does work once it starts.

Why MPI_Init is the killer without gang

Every multi-rank framework (PyTorch DDP, Horovod, JAX pjit, Megatron) bootstraps over a collective communication layer that requires all ranks to participate before any can proceed. The bootstrap protocol is approximately:

Each rank registers with a coordinator (NCCL bootstrap, MPI's PMIx, or torch.distributed's TCP/file backend).
The coordinator waits for all N ranks to register.
Once registered, ranks exchange addresses for the collective ring or tree topology.
Only then can the first all_reduce, all_gather, or any collective happen.

Step 2 is where the deadlock lives. If the scheduler dispatched only ranks 0-15 of a 64-rank job and the other 48 are still queued, ranks 0-15 wait at step 2 forever. They consume their wall-clock allocation, then time out and fail. The user sees "MPI_Init timed out" or "NCCL rendezvous failed"; the underlying cause is the scheduler.

Without gang scheduling, the workaround is per-job: framework-specific timeouts, retry logic, application-level barriers. Operationally messy and per-team. Gang scheduling moves the fix to the scheduler so every job benefits.

Slurm's gang scheduling does not exist in isolation. It composes with two other policies that shape who actually runs:

Backfill (SchedulerType=sched/backfill) lets short jobs slip past long-waiting jobs as long as they finish before the next reservation window. Without backfill, a 1-hour job could wait an entire day behind a queued 24-hour job; with backfill, it runs immediately if a node is free for the next hour. Gang and backfill compose: backfill finds the next slot where N nodes are simultaneously free, and dispatches the gang at that slot.

Fair-share (fair-share-queues) sets per-user or per-account priorities based on historical usage. A user who has consumed the lion's share of GPU-hours this week sees their priority dampened; the gang of a fresh user gets dispatched ahead of theirs.

The interaction sometimes surprises operators: a high-priority gang can wait behind a lower-priority job whose nodes are all free now, simply because the gang requires more nodes to be simultaneously free. Adding --exclusive or smaller node-count requests usually fixes the visible scheduling stall, at the cost of fragmenting the fleet.

What goes wrong even with gang scheduling

Gang scheduling does not solve every multi-rank issue. The classic failure modes that survive:

Stragglers within the gang. Even if every rank starts at t=0, if one node is thermally throttling or has a slow disk, that rank reaches the first collective late. Every other rank waits. See thermal stragglers for the operational angle.
One rank crashes mid-job. When rank 17 crashes 4 hours into a 12-hour run, the scheduler does not have a notion of "restart this rank." The whole job typically dies. Frameworks that support elastic training (PyTorch elastic, Ray Train) recover, but most production jobs need a checkpoint and full restart. See gang failure.
Topology blindness. Slurm gang scheduling guarantees that N nodes start simultaneously, but does not guarantee they share a switch or a rail. For training jobs that depend on intra-rack NVLink or single-rail InfiniBand, you also need topology-aware placement.

Practical guidance

For any multi-node training job over 8 nodes, gang scheduling is mandatory. Set SchedulerType=sched/gang or use sched/backfill with appropriate gang behavior in slurm.conf.
Pair gang with backfill to keep utilization up; pure gang without backfill leaves short jobs waiting unnecessarily.
Set realistic MaxJobCount and DefaultTime so the scheduler can plan reservation windows. Backfill's quality is only as good as its time estimates.
For Kubernetes-native shops, Volcano and Kueue are the K8s analogs. They implement the same gang semantics on top of kube-scheduler.

The takeaway: gang scheduling is the difference between a multi-node training fleet that runs and one that mostly waits. Slurm has it built in; Kubernetes needs Volcano. Either way, the scheduler's job is to start every rank at once or none.

What gang scheduling actually does

Why MPI_Init is the killer without gang

How Slurm gangs interact with backfill and fair-share

What goes wrong even with gang scheduling

Practical guidance

See also