Volcano Scheduler
Default kube-scheduler does not understand the concept of a job that is not ready until every Pod is ready. It schedules Pods one at a time, in priority order, picking whichever node currently fits each Pod. For a stateless web service, that is exactly right. For a 64-Pod training job that deadlocks if 63 Pods are running and one is still pending, it is a recipe for stalled jobs and burned GPU-hours. Volcano is the CNCF project that fixes this by replacing the default scheduler for batch workloads with one that has gang semantics, queues, fair-share, and the rest of the HPC scheduling vocabulary.
What Volcano adds on top of kube-scheduler
Volcano does not run alongside the default scheduler; it replaces it for any Pod that opts in via a schedulerName: volcano field. Once a Pod is volcano-scheduled, it gets access to a stack of features the default scheduler does not provide:
- PodGroup CRD with MinAvailable. Group N Pods together and tell Volcano "only schedule any of them when at least M can run together." For training jobs, M usually equals N (every rank or none).
- Queues. Workloads are dispatched from named queues with capacity limits and priorities. Useful for separating teams, environments (research vs production), or job types.
- Fair-share. Priorities calibrate against historical usage so high-volume teams do not starve smaller ones. See fair-share queues.
- Backfill. Like Slurm, Volcano slips short jobs past long-waiting reservations when there is room.
- Plugin architecture. Topology-awareness, SLA enforcement, NUMA alignment, and other policies plug in as scheduler plugins.
The CRD that ties it together is Job (Volcano's, not Kubernetes' batch/v1 Job): one CRD describes the whole multi-rank workload, and Volcano materializes it as a PodGroup plus the right number of Pods.
How a Volcano job actually flows
A typical PyTorch DDP training job submitted to Volcano:
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: llama-pretrain-64gpu
spec:
schedulerName: volcano
minAvailable: 8
queue: training
policies:
- event: PodEvicted
action: RestartJob
tasks:
- replicas: 8
name: worker
template:
spec:
containers:
- name: pytorch
image: ml/pytorch:2.4
resources:
limits:
nvidia.com/gpu: 8What happens at runtime:
- The Volcano controller creates a
PodGroupwithminAvailable: 8and 8 child Pods, none of which is bound to a node yet. - The Volcano scheduler enters its session loop. Each session it walks the
trainingqueue, picks the next PodGroup, and asks "are 8 nodes simultaneously free that can each run one of these Pods?" - If yes, all 8 Pods are bound atomically. The Pods start, NCCL bootstraps, training begins.
- If no, the PodGroup waits. Other smaller jobs in the queue may slip past via backfill.
- If a Pod is later evicted (preemption, node failure), the
RestartJobpolicy kicks in and the entire group restarts from a checkpoint.
The atomicity in step 3 is the entire point. Default kube-scheduler would have bound Pods 1-7 the moment they fit and left Pod 8 pending; Volcano refuses to bind any until all 8 fit.
Where Volcano competes with Kueue and KAI Scheduler
Volcano is one of three CNCF-adjacent gang schedulers commonly deployed in production:
- Volcano is the oldest, most feature-rich, and the one most teams reach for if they need everything in one project. Heavy custom scheduler.
- Kueue is newer, built by the Kubernetes SIG, and integrates more cleanly with the default scheduler (Kueue handles admission and queueing; kube-scheduler handles binding). Lighter weight.
- KAI Scheduler (NVIDIA, formerly Run:AI) bundles fair-share, gang, and GPU-aware fragmentation policies. Acquired into NVIDIA's enterprise stack.
Pick on operational fit. If your shop needs hard gang semantics with queues and fair-share and you are willing to swap out the default scheduler, Volcano is the most mature. If you want gang admission without replacing kube-scheduler, Kueue is lighter. If you are already paying for NVIDIA enterprise software, KAI is integrated.
What still goes wrong with Volcano
Three patterns to watch for:
-
PodGroup deadlock at fleet capacity. If two large PodGroups each request 64 Pods on a 100-Pod cluster, both wait forever (each is "almost ready"). Volcano needs preemption rules to break the tie. The
Reclaimaction in the scheduler config does this; without it, the cluster wedges. -
Eviction cascades. If
RestartJobfires on a 4-hour-old training run because one Pod was preempted, the entire job restarts from the last checkpoint. SetRestartTaskOnlyif your framework supports rank recovery (PyTorch elastic, Ray Train); useRestartJobonly when you cannot. -
Custom scheduler operational cost. Volcano replaces a battle-tested kube-scheduler with a less-tested one. Production deployments need to monitor scheduler latency, queue depth, and PodGroup admit-time as first-class metrics.
Practical guidance
- For multi-rank training in Kubernetes, install Volcano (or Kueue, or KAI) and stop trying to make default kube-scheduler do gang work it cannot.
- Set
MinAvailableequal to the rank count for hard gang; lower if your framework supports elastic training. - Configure
Reclaimto break PodGroup deadlocks. Without it, two competing large jobs can wedge the fleet. - Scope queues per team or per workload class so fair-share has something to balance.
The takeaway: Kubernetes is fine for inference and stateless services; for batch and training it needs gang semantics. Volcano is the most-shipped option and the safest default for AI shops on K8s.
See also
Updated 2026-05-10