Skip to main content
8 terms

Cluster Management

GPU cluster orchestration — how jobs are scheduled, how GPUs are partitioned, and where workloads are placed — directly determines utilization rates and cost efficiency. A poorly configured scheduler can leave expensive GPUs idle while jobs queue, and topology-unaware placement can force collective communication traffic across slow PCIe links instead of fast NVLink meshes, degrading training throughput by 40% or more. This section covers the scheduling, partitioning, and resource management concepts essential for GPU infrastructure — from SLURM job scheduling and MIG partitioning for GPU sharing, to gang scheduling that ensures all ranks of a distributed job start simultaneously, to node draining procedures that safely evacuate workloads before maintenance. Each term includes operational context on how these mechanisms interact with GPU health monitoring and the automation that Factryze provides for topology-aware scheduling decisions.