Kubernetes GPU Scheduling
Kubernetes does not natively understand GPUs. The kube-scheduler ships with knowledge of CPU, memory, ephemeral storage, and not much else. Every other resource (GPUs, RDMA NICs, Intel QAT, Habana accelerators) reaches the scheduler through a generic plug point called the device plugin API, and the gap between "I have GPUs in my cluster" and "the scheduler picks the right node for my Pod" is exactly that plug point and the operator stack around it.
How a GPU becomes nvidia.com/gpu: 4 in a Pod spec
The NVIDIA driver is a kernel module. Without it the GPU is invisible to userspace. The NVIDIA GPU Operator (a DaemonSet that ships as a Helm chart) is the standard way to install and version-pin the driver across every node in a cluster, along with the components that depend on it: the container toolkit, the device plugin, DCGM exporter for monitoring, and optionally the MIG manager.
The device plugin is the piece that talks to kubelet over a Unix socket using the device-plugin gRPC protocol. On startup it enumerates the GPUs visible to the node (via nvidia-smi or NVML) and registers each one as an instance of an extended resource named nvidia.com/gpu. Kubelet then publishes the count to the API server, which surfaces it as node.status.allocatable["nvidia.com/gpu"]. From that moment on, kube-scheduler treats GPUs like any other countable resource: a Pod requesting resources.limits["nvidia.com/gpu"]: 4 only fits on a node with at least 4 allocatable.
When the scheduler picks a node and kubelet starts the container, the device plugin returns a list of device IDs and the container runtime (containerd or cri-o through the NVIDIA container toolkit) injects them into the container's cgroup. The container sees its assigned GPUs as /dev/nvidia0, /dev/nvidia1, and the rest are not visible. Cgroup isolation is what enforces the count.
What the default scheduler does and does not do
Default kube-scheduler is a fitness check, not a placement strategy. It picks any node that has the requested count free. It does not consider topology (whether the 4 GPUs share an NVLink domain), it does not coordinate Pods that should start together (a 64-GPU training run that needs every Pod up before any can do work), it does not consider HBM bandwidth contention, and it does not know about MIG slice profiles in detail.
The patterns most teams reach for to fix these gaps:
- Topology hints via the topology manager. The kubelet topology manager can align CPU and GPU to the same NUMA node. Useful for inference, less so for training that crosses NUMA anyway.
- A real gang scheduler. Volcano or Kueue replaces the default scheduler for batch workloads and adds gang semantics, queues, and quotas. Necessary for any multi-Pod training job over a few nodes.
- MIG and MPS. MIG carves a GPU into hardware-isolated slices that the device plugin can expose as separate resources (
nvidia.com/mig-3g.40gb). MPS lets multiple Pods share one GPU under a single CUDA context, which the device plugin can expose as time-sliced replicas.
Operational realities
The GPU Operator is a hard dependency in production. Without it, every node needs manual driver installs, version-pinning becomes a per-node ticket, and a kernel upgrade silently breaks the device plugin until somebody notices. The Operator handles driver-container builds, signs them against the running kernel, and reloads the module on upgrade.
DCGM exporter is the health monitoring side of the same story. The Operator deploys it as a DaemonSet that scrapes per-GPU telemetry (temperature, ECC errors, utilization, XID events) and exposes it on a Prometheus endpoint. Without DCGM, the cluster knows a GPU exists; it does not know if the GPU is healthy. Pair the device plugin's "this GPU is allocatable" with DCGM's "this GPU is degraded" through node taints or a separate health controller; otherwise kube-scheduler will happily place jobs on a GPU that is dropping ECC errors.
The other gotcha: kube-scheduler does not know about HBM. Two Pods each requesting nvidia.com/gpu: 1 can land on the same node, and if both have heavy HBM traffic on the same memory channels, they will throttle each other even though the count math is correct. For inference serving, this is rarely a problem; for training it is one of the reasons gang schedulers always win at fleet scale.
Practical guidance
- Install the GPU Operator. Do not roll your own driver installs at fleet scale.
- Run DCGM exporter and wire its alerts into your node-taint controller.
- Use the default kube-scheduler only for inference and single-node fine-tuning. Multi-Pod training needs Volcano or equivalent.
- For multi-tenant clusters, partition with MIG when isolation matters and MPS when only throughput does.
The takeaway: making Kubernetes see GPUs is a solved problem. Making Kubernetes schedule them well is a stack of conventions and add-ons that every fleet operator ends up rebuilding. The device plugin is the foundation; everything else in this chapter is a patch on top.
See also
Updated 2026-05-10