Multi-Tenant Isolation

A multi-tenant GPU cluster is a shared liability. Tenant A's misbehaving CUDA kernel, tenant B's malicious image, and tenant C's runaway training script all want the same silicon, and any one of them can take the others down if the boundaries between them are weak. Multi-tenant isolation is the stack of layered fences that keeps a fault inside the tenant that caused it. Each layer catches a different class of blast.

The four-layer stack

Production multi-tenant clusters layer four kinds of boundary, from the outside in. Each one fails differently and catches a different class of fault.

Namespace and RBAC. The Kubernetes namespace is the auth boundary: API access, secrets, service accounts, role bindings. Done well, tenant A cannot see tenant B's secrets, list their Pods, exec into their containers, or read their logs. Done badly, a misconfigured ClusterRoleBinding gives one tenant cluster-wide read access. The annual security review is mostly about whether this layer holds.

Network policy. Default Kubernetes networking lets every Pod talk to every other Pod. Network policies (via Calico, Cilium, or the CNI of choice) restrict ingress and egress. For GPU clusters this matters because cross-tenant traffic on the InfiniBand or RoCE fabric can exfiltrate model weights or cause packet drops that affect collective performance. Egress policy (no traffic to the open internet from training Pods) is the most-shipped pattern.

Resource quota. ResourceQuota objects cap how much of a resource a namespace can request: CPU, memory, GPU count, ephemeral storage. Without quota, one tenant can submit jobs that monopolize the cluster's allocatable until kube-scheduler runs out. With quota, even a misconfigured submission queue cannot consume more than the namespace's share. ResourceQuota is the soft enforcement; combined with fair-share queues it becomes hard.

Hardware partitioning. MIG slices give each tenant a dedicated chunk of GPU silicon: separate SMs, separate HBM, separate L2, separate fault domain. A tenant's CUDA kernel that hangs the SM cannot affect another tenant's MIG instance. MPS gives shared SMs but not hardware isolation; appropriate for cooperative tenants in one trust domain, never for adversarial multi-tenancy.

What each layer actually catches

The layers are not redundant; each one catches faults the others miss.

A naive misconfigured Pod (tenant A asks for 8 GPUs but writes their YAML wrong) bounces off the namespace's ResourceQuota and never reaches kube-scheduler. A malicious Pod (tenant A tries to read tenant B's logs through the API server) is stopped at RBAC. A noisy network neighbor (tenant A's data loader floods the rack switch with traffic) is throttled by network policy. A CUDA bug (tenant A's kernel hangs the GPU's SM) is contained by MIG: tenant B's MIG slice on the same physical card keeps running.

What no layer catches: a hardware failure that takes the whole GPU offline. An XID 79 (GPU has fallen off the bus) takes down every MIG slice on that card. A power supply failure takes down every Pod on the node. Hardware blast radius is bounded by the GPU itself, then by the node, then by the rack PDU; isolation layers are software. See fault domains for the hardware angle.

How isolation interacts with the scheduler

Two scheduler-side patterns make isolation usable:

Per-tenant queues. A scheduler queue per tenant (Volcano, Slurm partition, Kueue ClusterQueue) gives the scheduler a place to enforce quotas and fair-share before any Pod is created. Without per-tenant queues, all tenants compete in one priority pool and quota becomes a Pod-creation-time check rather than a scheduling-time decision.

Anti-affinity for sensitive workloads. Tenants with strict isolation requirements (HIPAA workloads, customer-sensitive training data) can use Pod anti-affinity to force their Pods onto nodes where no other tenant runs. Cost: less bin-packing efficiency, less utilization. Right call for a small fraction of tenants, wrong default for a multi-tenant cluster.

The boundary that gets the most operational attention is the GPU partitioning layer. MIG slices are configured at boot and require a node drain to reconfigure; MPS daemons share fault domains. Production multi-tenant offerings (cloud GPU services, internal platforms) typically pick one strategy per node pool and route tenants accordingly.

What goes wrong

Three failure patterns to watch for:

Network policy gaps on the high-speed fabric. Network policies typically apply to the K8s overlay network (eth0). If your tenants do RDMA over a separate IB or RoCE NIC, the policies do not apply unless you have a CNI that supports multi-NIC policy enforcement (e.g., Multus + a second-NIC policy plugin). Tenant traffic on the IB fabric is often un-policed by default.
DCGM metrics leakage. Per-Pod GPU metrics from DCGM exporter expose runtime characteristics (HBM usage patterns, kernel timing) that can fingerprint a tenant's model. Restrict the metrics endpoint with RBAC if your tenants are adversarial.
Driver / firmware upgrade as a fault domain. A driver update that misbehaves takes down every tenant on every node simultaneously. Roll updates per node-pool, not fleet-wide; isolate by rolling restart.

Practical guidance

Use namespaces + RBAC + ResourceQuota + NetworkPolicy as the baseline. None of these layers alone is sufficient; together they catch most non-hardware faults.
Pick MIG over MPS for adversarial multi-tenancy. MPS is for cooperative tenants in one trust domain.
Add per-tenant Volcano queues so scheduler decisions respect tenancy from the priority calculation onward.
Audit DCGM endpoints and IB fabric policies; default configs leak more than most operators realize.

The takeaway: isolation is a stack, not a single fence. Each layer catches a different fault class, and a single missing layer turns the cluster into a single shared blast domain. Build the stack early; retrofitting isolation onto a running multi-tenant fleet is expensive.

The four-layer stack

What each layer actually catches

How isolation interacts with the scheduler

What goes wrong

Practical guidance

See also