MIG (Multi-Instance GPU)

Hardware partitioning on A100/H100 GPUs creating up to seven isolated GPU instances.

What it is

MIG (Multi-Instance GPU) is a hardware partitioning feature on NVIDIA A100, H100, and later GPUs that divides a single physical GPU into up to seven isolated instances, each with dedicated SMs, memory bandwidth, L2 cache, and HBM capacity. Available profiles define partition geometry -- on A100 80GB these range from 7g.80gb (full GPU) down to 1g.10gb. MIG instances are created via nvidia-smi mig commands and exposed to Kubernetes via the NVIDIA device plugin or to Slurm via GRES resource definitions.

Why it matters

MIG provides true hardware-level fault isolation: an ECC error in one instance does not affect others, unlike MPS or time-slicing. It enables 7x higher model density for small inference workloads compared to whole-GPU allocation. However, changing MIG profiles requires destroying existing instances, which terminates all running workloads on that GPU -- profile reconfiguration must be carefully coordinated with the scheduler.

How to monitor

Track ECC errors, utilization (DCGM_FI_DEV_GPU_UTIL), and memory usage (DCGM_FI_DEV_FB_USED) at the per-instance level using DCGM with MIG instance UUIDs. Confirm MIG profile configuration matches the intended layout via nvidia-smi mig -lgi. Factryze monitors MIG instances at both the physical GPU and per-instance level and can automatically reconfigure profiles between jobs based on workload requirements.

Related terms

GPU Partitioning

Sharing a single GPU across workloads via MIG, MPS, or time-slicing mechanisms.

Job Scheduling

Allocating GPU cluster resources using FIFO, fair-share, or priority-based policies.

GPU Utilization

Percentage of time GPU streaming multiprocessors are actively executing kernels.

Monitor this automatically

Factryze correlates GPU signals in real time: errors, clocks, and fabric health.

Get Started Free