Skip to main content

Scale AtlasChapter 6 of 86 termsUpdated 2026-05-10

Orchestration

Idle GPUs are the most expensive thing in the building. The scheduler is what stops them from sitting idle and stops tenants from stepping on each other. Kubernetes sees GPUs through device plugins; Slurm and Volcano coordinate gangs that must start as one; fair-share queues and preemption decide who waits and who gets evicted; multi-tenant isolation contains the blast when something breaks.

waiting: 8 GPUs reserved, 1 still pendingreservedreservedreservedreservedreservedreservedreservedreservedpending47s idlerunning: all 9 ranks start togetherr0r1r2r3r4r5r6r7r8idle GPUs are the most expensive thing in the building.