Scale AtlasChapter 6 of 86 termsUpdated 2026-05-10
Orchestration
Idle GPUs are the most expensive thing in the building. The scheduler is what stops them from sitting idle and stops tenants from stepping on each other. Kubernetes sees GPUs through device plugins; Slurm and Volcano coordinate gangs that must start as one; fair-share queues and preemption decide who waits and who gets evicted; multi-tenant isolation contains the blast when something breaks.
Fair-Share Queues
Slurm and Volcano scheduling policies that allocate GPU time across teams over a sliding window. Yesterday's heavy users see today's priority dampened; light users get a boost.
Kubernetes GPU Scheduling
Device plugins surface GPUs as schedulable resources. The NVIDIA GPU Operator wires up the driver, the device plugin, DCGM, and MIG so kube-scheduler can match Pods to silicon.
Multi-Tenant Isolation
Stacked boundaries that keep one tenant's GPU faults from affecting another's: namespace and RBAC, network policy, resource quota, MIG or MPS partitioning. Each layer catches a different blast type.
Preemption Strategies
When and how to interrupt a running job for a higher-priority workload. Three modes: kill (state lost), checkpoint-and-evict (state saved), demote (stays running at lower priority).
Slurm Gang Scheduling
Reserve every node for a multi-rank job and start them all at once. Without it, ranks deadlock in MPI_Init waiting for peers that are still queued, burning GPU-hours on idle reservations.
Volcano Scheduler
Batch-style gang-aware Kubernetes scheduler used widely for AI training. Brings PodGroup, queues, and MinAvailable semantics that the default kube-scheduler lacks.