MPS (Multi-Process Service)

When two CUDA processes both want the same GPU, the default behavior is time-slicing: each process gets exclusive use for a brief window, then the driver context-switches to the other. The context switch flushes pipeline state, invalidates caches, and costs roughly 100 microseconds per switch. For workloads doing many small kernels, that overhead can eat 20% of throughput. Multi-Process Service (MPS) is NVIDIA's fix: route all the processes through one shared CUDA context with concurrent kernel execution and no switching.

What MPS actually does

MPS runs as a server daemon (nvidia-cuda-mps-control) on the host. CUDA processes that connect to it appear to the GPU as a single context with multiple streams; their kernels can execute concurrently on separate SMs. The hardware scheduler arbitrates which SM runs which stream's kernel; from the application's perspective, each process still calls cudaLaunchKernel and gets results back, but under the hood the work has been blended.

The activation:

# Start the MPS control daemon
nvidia-cuda-mps-control -d
 
# In each application process
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
python my_inference_server.py

Once the daemon is running, every CUDA process on the host that sets the env vars routes through MPS automatically. Stop with echo quit | nvidia-cuda-mps-control.

Where MPS wins

Two main cases. First, multiple small inference workers on one GPU. A model that uses 30% of an H100's SMs but holds it for the duration of a request will, under default sharing, force every other request to context-switch. With MPS, requests run concurrently and the GPU stays busy. vLLM and TensorRT-LLM both support MPS deployment for higher per-GPU throughput when one model does not saturate the silicon.

Second, MPI training jobs where many ranks share a GPU. The classic case is HPC simulations where 8 or 16 ranks per node want to launch CUDA work; default sharing makes them serialize at the GPU. MPS lets them issue concurrently. The overhead of the daemon is minimal (one process, low resource use); the throughput win can be 2x or more on launch-bound workloads.

Where MPS does not work

MPS shares one CUDA context. That means:

No isolation. A bug in one client (a kernel that hangs, an out-of-memory error, a cudaErrorIllegalAddress) takes down all clients. There is no fault containment.
No memory partitioning. All clients see the same HBM pool. If client A allocates 60 GB on an 80 GB H100, client B sees an OOM error on its next allocation. You can use CUDA_MPS_ACTIVE_THREAD_PERCENTAGE to limit each client's SM share, but memory remains shared.
No security boundary. Clients can read each other's memory if they have the addresses. MPS is for cooperative workloads inside one trust domain, not for multi-tenant clusters.

This is the comparison with MIG: MIG carves the silicon at the hardware level (separate SMs, separate HBM, separate fault domain) at the cost of inflexibility (profiles fixed at boot). MPS shares the silicon at the software level (one context, all SMs available to all clients) at the cost of zero isolation.

What goes wrong in practice

The MPS daemon does not survive driver reload or GPU reset. If nvidia-smi resets a GPU (because of an XID error or operator action), every MPS client crashes simultaneously. Production deployments need MPS health checks in their orchestration layer; Kubernetes operators handle this via the GPU Operator's MPS strategy, but Slurm setups have to script it.

The other gotcha: MPS adds latency to the first kernel launch from each client (roughly 1-2 ms while the daemon registers the client). For latency-sensitive inference serving, that first-launch cost matters and warming up the connection at startup is a common pattern.

Practical guidance

Use MPS for cooperative workloads from one team or one trust domain: ML inference servers, MPI training, multi-rank-per-GPU HPC jobs.
Do not use MPS for multi-tenant clusters; the lack of isolation is a security and reliability hole.
Pair MPS with CUDA_MPS_ACTIVE_THREAD_PERCENTAGE to cap each client's SM share when you need predictable per-client throughput.
If you need both isolation and sharing, use MIG; if you need throughput from cooperative workloads, use MPS; if you need both isolation and a single CUDA context, you are out of luck on current hardware.

The takeaway: MPS is the cheapest way to fix per-process context-switch overhead on a shared GPU. It is not partitioning; it is removal of switching cost. Pick the right tool: MIG for isolation, MPS for throughput, default sharing for neither.

What MPS actually does

Where MPS wins

Where MPS does not work

What goes wrong in practice

Practical guidance

See also