GPU Partitioning

Sharing a single GPU across workloads via MIG, MPS, or time-slicing mechanisms.

What it is

GPU partitioning divides a single physical GPU's compute and memory across multiple concurrent workloads using one of three mechanisms: MIG for hardware-level isolation with dedicated SMs, L2 cache, and HBM per partition; MPS (Multi-Process Service) for fine-grained compute sharing across CUDA contexts with configurable compute percentage limits via CUDA_MPS_ACTIVE_THREAD_PERCENTAGE; or time-slicing via the Kubernetes device plugin, which round-robins contexts with no memory or fault isolation.

Why it matters

Choosing the wrong partitioning mode for the workload creates either wasted capacity or fault blast radius. MPS offers finer granularity than MIG but lacks memory fault isolation -- a CUDA error in one MPS client can crash all co-located processes. Time-slicing introduces context-switch overhead and provides no isolation, making it unsuitable for latency-sensitive inference. Incorrect partitioning silently degrades both throughput and reliability.

How to monitor

Track utilization and memory usage at the partition level using per-MIG-instance DCGM fields or per-PID nvidia-smi output for MPS clients. Monitor for unexpected process crashes that may indicate cross-client fault propagation in MPS setups. Factryze tracks utilization and error rates across all three partitioning modes and identifies underprovisioned and overprovisioned partitions based on observed workload profiles.

Related terms

MIG (Multi-Instance GPU)

Hardware partitioning on A100/H100 GPUs creating up to seven isolated GPU instances.

GPU Utilization

Percentage of time GPU streaming multiprocessors are actively executing kernels.

Job Scheduling

Allocating GPU cluster resources using FIFO, fair-share, or priority-based policies.

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free