Preemption

Forcibly stopping lower-priority GPU jobs with checkpoint/restart to free resources.

What it is

Preemption is the act of forcibly stopping or suspending a lower-priority GPU job to free resources for a higher-priority job. Slurm supports multiple preemption modes -- CANCEL (immediate kill), CHECKPOINT (trigger checkpoint then cancel), REQUEUE (requeue for later), and SUSPEND -- configured via PreemptMode in slurm.conf with a GraceTime parameter (typically 60-300 seconds). Kubernetes uses terminationGracePeriodSeconds to allow checkpoint frameworks to save state before pod eviction.

Why it matters

Preemption without checkpoint awareness wastes all GPU-hours consumed since the last saved checkpoint. For a 256-GPU training job running 4 hours since its last checkpoint, a blind preemption destroys 1,024 GPU-hours of compute. Checkpoint recency at preemption time is the key operational variable -- preempting a job that checkpointed 5 minutes ago is nearly free; preempting one that has run 3 hours uncheckpointed is catastrophically wasteful.

How to monitor

Track checkpoint recency for all running jobs and compare against GraceTime and checkpoint write duration to ensure checkpoints can complete before kill signals. Monitor job requeue counts in squeue to detect preemption-heavy periods. Factryze integrates with checkpoint/restart workflows to track checkpoint recency across all running jobs and recommends optimal preemption targets that minimize total compute loss.

Related terms

Job Scheduling

Allocating GPU cluster resources using FIFO, fair-share, or priority-based policies.

Slurm

Open-source HPC workload manager scheduling GPU cluster jobs via srun, sbatch, and squeue.

Gang Scheduling

Atomic co-scheduling of all GPUs for distributed training requiring synchronized start.

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free