Preemption Strategies

A scheduler that cannot preempt is a scheduler stuck behind whatever is running. The flagship 4-day pretraining job submitted last Tuesday holds 256 GPUs; the urgent inference incident needs 32 GPUs right now. Without preemption, the inference job waits four days. Preemption is the policy that lets the scheduler take GPUs back from a running job to give to a more important one. The interesting question is how.

The three preemption modes

Production schedulers ship three flavors of preemption, and the choice has very different SLA consequences:

Kill. The scheduler sends SIGKILL (or the cluster equivalent) to every Pod or rank in the preempted job. State in HBM and in process memory is gone. The job's restart logic, if any, kicks in: the framework reads the last checkpoint from disk and restarts. If checkpointing is hourly and the kill arrived 50 minutes into the hour, the job loses 50 minutes of work. If checkpointing is per-minute, it loses up to a minute. If there is no checkpointing, the job starts from scratch.

Checkpoint and evict. Before sending the kill, the scheduler sends a coordinated signal (often SIGTERM with a configurable grace period, or a custom hook) telling the framework to checkpoint to a parallel filesystem. The framework writes optimizer state, model weights, dataloader cursor, RNG seeds, and any other recoverable state. Once the checkpoint completes, the scheduler kills the Pods and dispatches the higher-priority job. The preempted job is requeued and resumes from the just-written checkpoint when capacity returns.

Demote. The scheduler does not kill the job at all; it changes the job's queue or priority class so that it gets less of a fair-share next round. Useful when the higher-priority job needs more capacity than the cluster has at this priority but does not need the specific GPUs the running job holds. Effective only on heterogeneous clusters or when MIG / MPS sharing lets the new job land on the same hardware.

When to choose each mode

The choice is driven by three factors: how cheap is the preempted job's checkpoint, how urgent is the new job, and how much GPU-hours can you afford to throw away.

Kill is right when the new job is genuinely urgent (production incident, customer-facing inference SLA) and the preempted job's checkpoint is cheap enough that losing the time-since-last-checkpoint is acceptable. Setup: short kill grace period (10-30 seconds), framework writes a quick "soft" checkpoint if it can, otherwise full restart.

Checkpoint and evict is right when the new job is important but not critical, and the preempted job's checkpoint is expensive (large model, many ranks, slow PFS). The scheduler grants a longer grace period (minutes) and the framework does a full state dump. The total cost is the new job's wait + the checkpoint write time + the preempted job's eventual resume read; for a 70B model that can be 5-10 minutes per side. Net: the new job waits 5-10 minutes, the preempted job loses no work, and total throughput is preserved. Most production training clusters default to this mode.

Demote is right when the new job needs priority but not exclusivity. Useful for batch inference workloads, fine-tuning runs that can share with smaller training jobs, or when the new job fits on idle capacity that just was not visible to the priority calculation.

Preemption does not exist in a vacuum. Two interactions matter:

Gang preemption is harder than single-Pod preemption. Killing one Pod of a 64-rank gang takes the whole gang down (the remaining 63 will time out in their next collective). Schedulers with gang awareness (Slurm, Volcano, Kueue) preempt all-or-nothing: either the whole gang is killed/checkpointed/demoted, or none is. Schedulers without gang awareness can kill one Pod and trigger a cascade restart, which is worse than just killing the whole job in the first place.

Fair-share preemption (fair-share-queues) is what makes "this team has used too much, evict their oldest job" work. Without preemption, fair-share only affects future scheduling; with preemption, fair-share reclaims actively-running capacity. The combination is what most operators mean by "the cluster is fair."

What goes wrong in practice

Two common production issues:

Checkpoint corruption under preemption. A SIGTERM that arrives mid-write can corrupt the checkpoint. The framework needs atomic checkpoint writes (write to .tmp, fsync, rename), and the scheduler needs to give a grace period long enough to complete. PyTorch Lightning, DeepSpeed, and Megatron-LM all handle this; custom training loops often do not. When in doubt, shard checkpoints and write atomically.
Preemption thrash. When two high-priority queues compete and each preempts the other in turn, the cluster does no useful work. Mitigate with MinRunTime policies (job must run for N minutes before becoming preemptible) and per-team preemption budgets.

Practical guidance

Default to checkpoint-and-evict for training workloads. Set the grace period to twice your worst-case checkpoint time.
Use kill for inference incidents where every second of new-job latency costs more than the GPU-hours of the preempted job.
Use demote for cooperative-share scenarios; do not stretch it to cases where the new job actually needs the silicon.
Pair preemption with gang awareness; preempting one Pod of a gang without coordinating breaks more than it fixes.
Set MinRunTime to at least 5 minutes to avoid thrash. Higher for jobs with expensive startup.

The takeaway: preemption is what gives the scheduler the right to reclaim. Picking the right mode for each workload is the difference between a fleet that responds to urgency and one that thrashes itself into idle.

The three preemption modes

When to choose each mode

How preemption interacts with gang and fair-share

What goes wrong in practice

Practical guidance

See also