Skip to main content

Volcano Scheduler

Batch-style gang-aware Kubernetes scheduler used widely for AI training. Brings PodGroup, queues, and MinAvailable semantics that the default kube-scheduler lacks.
Project
CNCF (Volcano)
Primary CRD
PodGroup with MinAvailable
Replaces
default kube-scheduler for batch

Default kube-scheduler does not understand the concept of a job that is not ready until every Pod is ready. It schedules Pods one at a time, in priority order, picking whichever node currently fits each Pod. For a stateless web service, that is exactly right. For a 64-Pod training job that deadlocks if 63 Pods are running and one is still pending, it is a recipe for stalled jobs and burned GPU-hours. Volcano is the CNCF project that fixes this by replacing the default scheduler for batch workloads with one that has gang semantics, queues, fair-share, and the rest of the HPC scheduling vocabulary.

What Volcano adds on top of kube-scheduler

Volcano does not run alongside the default scheduler; it replaces it for any Pod that opts in via a schedulerName: volcano field. Once a Pod is volcano-scheduled, it gets access to a stack of features the default scheduler does not provide:

  • PodGroup CRD with MinAvailable. Group N Pods together and tell Volcano "only schedule any of them when at least M can run together." For training jobs, M usually equals N (every rank or none).
  • Queues. Workloads are dispatched from named queues with capacity limits and priorities. Useful for separating teams, environments (research vs production), or job types.
  • Fair-share. Priorities calibrate against historical usage so high-volume teams do not starve smaller ones. See fair-share queues.
  • Backfill. Like Slurm, Volcano slips short jobs past long-waiting reservations when there is room.
  • Plugin architecture. Topology-awareness, SLA enforcement, NUMA alignment, and other policies plug in as scheduler plugins.

The CRD that ties it together is Job (Volcano's, not Kubernetes' batch/v1 Job): one CRD describes the whole multi-rank workload, and Volcano materializes it as a PodGroup plus the right number of Pods.

default kube-scheduler: pods scheduled one at a timePod 1scheduledPod 2scheduledPod 3scheduledPod 4pendingstallMPI_InitVolcano: PodGroup with MinAvailable, all-or-nothingVolcano JobCRDPodGroupMinAvailable=4queue checkall 4 free?dispatch allatomicgang semantics for K8s batch workloads

How a Volcano job actually flows

A typical PyTorch DDP training job submitted to Volcano:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: llama-pretrain-64gpu
spec:
  schedulerName: volcano
  minAvailable: 8
  queue: training
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
    - replicas: 8
      name: worker
      template:
        spec:
          containers:
            - name: pytorch
              image: ml/pytorch:2.4
              resources:
                limits:
                  nvidia.com/gpu: 8

What happens at runtime:

  1. The Volcano controller creates a PodGroup with minAvailable: 8 and 8 child Pods, none of which is bound to a node yet.
  2. The Volcano scheduler enters its session loop. Each session it walks the training queue, picks the next PodGroup, and asks "are 8 nodes simultaneously free that can each run one of these Pods?"
  3. If yes, all 8 Pods are bound atomically. The Pods start, NCCL bootstraps, training begins.
  4. If no, the PodGroup waits. Other smaller jobs in the queue may slip past via backfill.
  5. If a Pod is later evicted (preemption, node failure), the RestartJob policy kicks in and the entire group restarts from a checkpoint.

The atomicity in step 3 is the entire point. Default kube-scheduler would have bound Pods 1-7 the moment they fit and left Pod 8 pending; Volcano refuses to bind any until all 8 fit.

Where Volcano competes with Kueue and KAI Scheduler

Volcano is one of three CNCF-adjacent gang schedulers commonly deployed in production:

  • Volcano is the oldest, most feature-rich, and the one most teams reach for if they need everything in one project. Heavy custom scheduler.
  • Kueue is newer, built by the Kubernetes SIG, and integrates more cleanly with the default scheduler (Kueue handles admission and queueing; kube-scheduler handles binding). Lighter weight.
  • KAI Scheduler (NVIDIA, formerly Run:AI) bundles fair-share, gang, and GPU-aware fragmentation policies. Acquired into NVIDIA's enterprise stack.

Pick on operational fit. If your shop needs hard gang semantics with queues and fair-share and you are willing to swap out the default scheduler, Volcano is the most mature. If you want gang admission without replacing kube-scheduler, Kueue is lighter. If you are already paying for NVIDIA enterprise software, KAI is integrated.

What still goes wrong with Volcano

Three patterns to watch for:

  1. PodGroup deadlock at fleet capacity. If two large PodGroups each request 64 Pods on a 100-Pod cluster, both wait forever (each is "almost ready"). Volcano needs preemption rules to break the tie. The Reclaim action in the scheduler config does this; without it, the cluster wedges.

  2. Eviction cascades. If RestartJob fires on a 4-hour-old training run because one Pod was preempted, the entire job restarts from the last checkpoint. Set RestartTaskOnly if your framework supports rank recovery (PyTorch elastic, Ray Train); use RestartJob only when you cannot.

  3. Custom scheduler operational cost. Volcano replaces a battle-tested kube-scheduler with a less-tested one. Production deployments need to monitor scheduler latency, queue depth, and PodGroup admit-time as first-class metrics.

Practical guidance

  • For multi-rank training in Kubernetes, install Volcano (or Kueue, or KAI) and stop trying to make default kube-scheduler do gang work it cannot.
  • Set MinAvailable equal to the rank count for hard gang; lower if your framework supports elastic training.
  • Configure Reclaim to break PodGroup deadlocks. Without it, two competing large jobs can wedge the fleet.
  • Scope queues per team or per workload class so fair-share has something to balance.

The takeaway: Kubernetes is fine for inference and stateless services; for batch and training it needs gang semantics. Volcano is the most-shipped option and the safest default for AI shops on K8s.

See also

Updated 2026-05-10