Skip to main content
GPU Glossary/Cluster Management
Cluster Management

GPU Partitioning

Sharing a single GPU across workloads via MIG, MPS, or time-slicing mechanisms.

What it is

GPU partitioning divides a single physical GPU's compute and memory across multiple concurrent workloads using one of three mechanisms: MIG for hardware-level isolation with dedicated SMs, L2 cache, and HBM per partition; MPS (Multi-Process Service) for fine-grained compute sharing across CUDA contexts with configurable compute percentage limits via CUDA_MPS_ACTIVE_THREAD_PERCENTAGE; or time-slicing via the Kubernetes device plugin, which round-robins contexts with no memory or fault isolation.

Why it matters

Choosing the wrong partitioning mode for the workload creates either wasted capacity or fault blast radius. MPS offers finer granularity than MIG but lacks memory fault isolation -- a CUDA error in one MPS client can crash all co-located processes. Time-slicing introduces context-switch overhead and provides no isolation, making it unsuitable for latency-sensitive inference. Incorrect partitioning silently degrades both throughput and reliability.

How to monitor

Track utilization and memory usage at the partition level using per-MIG-instance DCGM fields or per-PID nvidia-smi output for MPS clients. Monitor for unexpected process crashes that may indicate cross-client fault propagation in MPS setups. Factryze tracks utilization and error rates across all three partitioning modes and identifies underprovisioned and overprovisioned partitions based on observed workload profiles.

GPU Partitioning - Sharing a Single GPUGPU Partitioning - Sharing a Single GPU
Pinch to zoom, drag to pan, double-tap to toggle
GPU Partitioning - Sharing a Single GPUGPU Partitioning - Sharing a Single GPU

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free