Cluster Management
GPU cluster orchestration — how jobs are scheduled, how GPUs are partitioned, and where workloads are placed — directly determines utilization rates and cost efficiency. A poorly configured scheduler can leave expensive GPUs idle while jobs queue, and topology-unaware placement can force collective communication traffic across slow PCIe links instead of fast NVLink meshes, degrading training throughput by 40% or more. This section covers the scheduling, partitioning, and resource management concepts essential for GPU infrastructure — from SLURM job scheduling and MIG partitioning for GPU sharing, to gang scheduling that ensures all ranks of a distributed job start simultaneously, to node draining procedures that safely evacuate workloads before maintenance. Each term includes operational context on how these mechanisms interact with GPU health monitoring and the automation that Factryze provides for topology-aware scheduling decisions.
Gang Scheduling
Atomic co-scheduling of all GPUs for distributed training requiring synchronized start.
GPU Partitioning
Sharing a single GPU across workloads via MIG, MPS, or time-slicing mechanisms.
Job Scheduling
Allocating GPU cluster resources using FIFO, fair-share, or priority-based policies.
MIG (Multi-Instance GPU)
Hardware partitioning on A100/H100 GPUs creating up to seven isolated GPU instances.
Node Draining
Gracefully removing a node from scheduling via kubectl drain or Slurm DRAIN state.
Preemption
Forcibly stopping lower-priority GPU jobs with checkpoint/restart to free resources.
Slurm
Open-source HPC workload manager scheduling GPU cluster jobs via srun, sbatch, and squeue.
Topology-Aware Placement
Scheduling GPU jobs by NVLink domain, NUMA affinity, and network switch locality.