GPU Partitioning
Sharing a single GPU across workloads via MIG, MPS, or time-slicing mechanisms.
What it is
GPU partitioning divides a single physical GPU's compute and memory across multiple concurrent workloads using one of three mechanisms: MIG for hardware-level isolation with dedicated SMs, L2 cache, and HBM per partition; MPS (Multi-Process Service) for fine-grained compute sharing across CUDA contexts with configurable compute percentage limits via CUDA_MPS_ACTIVE_THREAD_PERCENTAGE; or time-slicing via the Kubernetes device plugin, which round-robins contexts with no memory or fault isolation.
Why it matters
Choosing the wrong partitioning mode for the workload creates either wasted capacity or fault blast radius. MPS offers finer granularity than MIG but lacks memory fault isolation -- a CUDA error in one MPS client can crash all co-located processes. Time-slicing introduces context-switch overhead and provides no isolation, making it unsuitable for latency-sensitive inference. Incorrect partitioning silently degrades both throughput and reliability.
How to monitor
Track utilization and memory usage at the partition level using per-MIG-instance DCGM fields or per-PID nvidia-smi output for MPS clients. Monitor for unexpected process crashes that may indicate cross-client fault propagation in MPS setups. Factryze tracks utilization and error rates across all three partitioning modes and identifies underprovisioned and overprovisioned partitions based on observed workload profiles.
Related terms
Hardware partitioning on A100/H100 GPUs creating up to seven isolated GPU instances.
Percentage of time GPU streaming multiprocessors are actively executing kernels.
Allocating GPU cluster resources using FIFO, fair-share, or priority-based policies.
Monitor this automatically
Factryze correlates GPU signals in real time: errors, clocks, and fabric health.
Get Started Free