MIG (Multi-Instance GPU)
Hardware partitioning on A100/H100 GPUs creating up to seven isolated GPU instances.
What it is
MIG (Multi-Instance GPU) is a hardware partitioning feature on NVIDIA A100, H100, and later GPUs that divides a single physical GPU into up to seven isolated instances, each with dedicated SMs, memory bandwidth, L2 cache, and HBM capacity. Available profiles define partition geometry -- on A100 80GB these range from 7g.80gb (full GPU) down to 1g.10gb. MIG instances are created via nvidia-smi mig commands and exposed to Kubernetes via the NVIDIA device plugin or to Slurm via GRES resource definitions.
Why it matters
MIG provides true hardware-level fault isolation: an ECC error in one instance does not affect others, unlike MPS or time-slicing. It enables 7x higher model density for small inference workloads compared to whole-GPU allocation. However, changing MIG profiles requires destroying existing instances, which terminates all running workloads on that GPU -- profile reconfiguration must be carefully coordinated with the scheduler.
How to monitor
Track ECC errors, utilization (DCGM_FI_DEV_GPU_UTIL), and memory usage (DCGM_FI_DEV_FB_USED) at the per-instance level using DCGM with MIG instance UUIDs. Confirm MIG profile configuration matches the intended layout via nvidia-smi mig -lgi. Factryze monitors MIG instances at both the physical GPU and per-instance level and can automatically reconfigure profiles between jobs based on workload requirements.
Related terms
Sharing a single GPU across workloads via MIG, MPS, or time-slicing mechanisms.
Allocating GPU cluster resources using FIFO, fair-share, or priority-based policies.
Percentage of time GPU streaming multiprocessors are actively executing kernels.
Monitor this automatically
Factryze correlates GPU signals in real time: errors, clocks, and fabric health.
Get Started Free