Volcano Scheduler

Default kube-scheduler does not understand the concept of a job that is not ready until every Pod is ready. It schedules Pods one at a time, in priority order, picking whichever node currently fits each Pod. For a stateless web service, that is exactly right. For a 64-Pod training job that deadlocks if 63 Pods are running and one is still pending, it is a recipe for stalled jobs and burned GPU-hours. Volcano is the CNCF project that fixes this by replacing the default scheduler for batch workloads with one that has gang semantics, queues, fair-share, and the rest of the HPC scheduling vocabulary.

What Volcano adds on top of kube-scheduler

Volcano does not run alongside the default scheduler; it replaces it for any Pod that opts in via a schedulerName: volcano field. Once a Pod is volcano-scheduled, it gets access to a stack of features the default scheduler does not provide:

PodGroup CRD with MinAvailable. Group N Pods together and tell Volcano "only schedule any of them when at least M can run together." For training jobs, M usually equals N (every rank or none).
Queues. Workloads are dispatched from named queues with capacity limits and priorities. Useful for separating teams, environments (research vs production), or job types.
Fair-share. Priorities calibrate against historical usage so high-volume teams do not starve smaller ones. See fair-share queues.
Backfill. Like Slurm, Volcano slips short jobs past long-waiting reservations when there is room.
Plugin architecture. Topology-awareness, SLA enforcement, NUMA alignment, and other policies plug in as scheduler plugins.

The CRD that ties it together is Job (Volcano's, not Kubernetes' batch/v1 Job): one CRD describes the whole multi-rank workload, and Volcano materializes it as a PodGroup plus the right number of Pods.

How a Volcano job actually flows

A typical PyTorch DDP training job submitted to Volcano:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: llama-pretrain-64gpu
spec:
  schedulerName: volcano
  minAvailable: 8
  queue: training
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
    - replicas: 8
      name: worker
      template:
        spec:
          containers:
            - name: pytorch
              image: ml/pytorch:2.4
              resources:
                limits:
                  nvidia.com/gpu: 8

What happens at runtime:

The Volcano controller creates a PodGroup with minAvailable: 8 and 8 child Pods, none of which is bound to a node yet.
The Volcano scheduler enters its session loop. Each session it walks the training queue, picks the next PodGroup, and asks "are 8 nodes simultaneously free that can each run one of these Pods?"
If yes, all 8 Pods are bound atomically. The Pods start, NCCL bootstraps, training begins.
If no, the PodGroup waits. Other smaller jobs in the queue may slip past via backfill.
If a Pod is later evicted (preemption, node failure), the RestartJob policy kicks in and the entire group restarts from a checkpoint.

The atomicity in step 3 is the entire point. Default kube-scheduler would have bound Pods 1-7 the moment they fit and left Pod 8 pending; Volcano refuses to bind any until all 8 fit.

Where Volcano competes with Kueue and KAI Scheduler

Volcano is one of three CNCF-adjacent gang schedulers commonly deployed in production:

Volcano is the oldest, most feature-rich, and the one most teams reach for if they need everything in one project. Heavy custom scheduler.
Kueue is newer, built by the Kubernetes SIG, and integrates more cleanly with the default scheduler (Kueue handles admission and queueing; kube-scheduler handles binding). Lighter weight.
KAI Scheduler (NVIDIA, formerly Run:AI) bundles fair-share, gang, and GPU-aware fragmentation policies. Acquired into NVIDIA's enterprise stack.

Pick on operational fit. If your shop needs hard gang semantics with queues and fair-share and you are willing to swap out the default scheduler, Volcano is the most mature. If you want gang admission without replacing kube-scheduler, Kueue is lighter. If you are already paying for NVIDIA enterprise software, KAI is integrated.

What still goes wrong with Volcano

Three patterns to watch for:

PodGroup deadlock at fleet capacity. If two large PodGroups each request 64 Pods on a 100-Pod cluster, both wait forever (each is "almost ready"). Volcano needs preemption rules to break the tie. The Reclaim action in the scheduler config does this; without it, the cluster wedges.
Eviction cascades. If RestartJob fires on a 4-hour-old training run because one Pod was preempted, the entire job restarts from the last checkpoint. Set RestartTaskOnly if your framework supports rank recovery (PyTorch elastic, Ray Train); use RestartJob only when you cannot.
Custom scheduler operational cost. Volcano replaces a battle-tested kube-scheduler with a less-tested one. Production deployments need to monitor scheduler latency, queue depth, and PodGroup admit-time as first-class metrics.

Practical guidance

For multi-rank training in Kubernetes, install Volcano (or Kueue, or KAI) and stop trying to make default kube-scheduler do gang work it cannot.
Set MinAvailable equal to the rank count for hard gang; lower if your framework supports elastic training.
Configure Reclaim to break PodGroup deadlocks. Without it, two competing large jobs can wedge the fleet.
Scope queues per team or per workload class so fair-share has something to balance.

The takeaway: Kubernetes is fine for inference and stateless services; for batch and training it needs gang semantics. Volcano is the most-shipped option and the safest default for AI shops on K8s.

What Volcano adds on top of kube-scheduler

How a Volcano job actually flows

Where Volcano competes with Kueue and KAI Scheduler

What still goes wrong with Volcano

Practical guidance

See also