Skip to main content

MIG Partitioning

Multi-Instance GPU divides one A100 or H100 into up to 7 fully isolated GPU slices, each with its own SMs and HBM partition. The right answer when one job cannot fill a whole GPU.
Hardware
A100, H100, H200
Max slices
7 per GPU
Profiles (H100)
1g.10gb, 2g.20gb, 3g.40gb, 4g.40gb, 7g.80gb

Multi-Instance GPU is NVIDIA's hardware partitioning feature: one A100 or H100 carves into up to seven independent GPU instances, each with its own dedicated SMs, HBM, L2 cache, copy engines, and fault domain. Two jobs running on two MIG slices of the same physical GPU cannot starve each other for memory bandwidth, cannot crash each other through a kernel error, and cannot see each other's address space. It is the single best tool you have when one workload cannot fill a whole H100.

What MIG actually partitions

The H100 has 7 GPCs (Graphics Processing Clusters), each containing roughly 18 SMs. MIG divides at the GPC boundary. A "1g" slice is one GPC with its share of HBM (10 GB on the 80 GB SKU). A "3g" slice is three GPCs with 40 GB. A "7g" slice is the whole GPU. The supported profiles on H100 80 GB are:

  • 1g.10gb (7 max instances)
  • 1g.20gb (4 max)
  • 2g.20gb (3 max)
  • 3g.40gb (2 max)
  • 4g.40gb (1 max + 1g.10gb leftover)
  • 7g.80gb (1 max, full GPU)

You configure profiles with nvidia-smi mig -cgi and -cci; CUDA jobs see each instance as a separate GPU with its own UUID.

1g.10gb x 71g1g1g1g1g1g1g3g.40gb + 3g.40gb + 1g.10gb3g.40gb3g.40gb1g7g.80gb x 1 (full GPU)7g.80gbeach slice has dedicated SMs and HBM; isolation enforced by hardware

Where MIG is the right answer

The classic case: inference serving where each model copy needs maybe 12 GB and 1/7 of an SM allocation. Without MIG, you run one model per GPU and waste 80%. With MIG, you serve seven models on one GPU and saturate it. Triton Inference Server, vLLM, and TensorRT-LLM all support MIG-aware deployment via Kubernetes device plugins (the NVIDIA k8s device plugin reports each MIG instance as a separately schedulable resource).

The other case: multi-tenant clusters where small experimental jobs would otherwise take whole GPUs. A research user who wants to fine-tune a 7B model on one H100 for an afternoon does not need 80 GB; a 3g.40gb slice gets the job done and leaves room for someone else on the same physical GPU. Slurm and Kubernetes both support MIG with the right plugin configuration.

Where MIG is the wrong answer

Three cases where you do not want MIG:

  1. Training that needs the full HBM. A 13B model in BF16 with optimizer state needs roughly 78 GB. A 3g.40gb slice cannot hold it. The slice ceiling is the constraint, not the SM count.

  2. Workloads that benefit from peer-to-peer. MIG slices on the same GPU cannot communicate over NVLink (they can only talk through PCIe like any unrelated GPUs). If you wanted "two GPUs on one card," MIG is not it.

  3. Workloads where the throughput-per-dollar of running one job on the whole GPU beats sharing. If a 70B model serves 1000 tokens/sec on the full H100 but only 200 tokens/sec on a 4g.40gb slice (because of HBM bandwidth, not just SM count), MIG cuts your throughput per GPU even though it raises utilization. Always profile.

How MIG interacts with the rest of the stack

MIG profiles must be configured before any CUDA process starts; you cannot reconfigure mid-job. This means clusters running mixed MIG and non-MIG workloads need to drain a node, reconfigure, and bring it back. Kubernetes operators handle this automatically via the GPU Operator; manual setups need to script it.

MIG slices are isolated, but the GPU still appears as one device to the kernel driver, so a nvidia-smi reset of the parent GPU resets all slices. SDC (silent data corruption) on the parent GPU can affect any slice. The fault domain is "the slice" only for compute and memory; for hardware-level events the fault domain is "the whole GPU." See silent data corruption for the operational angle.

The other partitioning option, MPS, is software-only and does not give hardware isolation. Pick MIG when isolation matters (multi-tenant, fault containment) and MPS when it does not (cooperative jobs from one team).

Practical guidance

  • Use MIG for inference serving, multi-tenant fine-tuning, and any workload where one job cannot fill the GPU.
  • Decide profile per workload before deploying; reconfiguration is a node-drain event.
  • Watch HBM bandwidth per slice; the ratio matches the slice's GPC count, so a 1g slice gets 1/7 of total HBM bandwidth.
  • Combine with Kubernetes device plugins for fleet-scale deployment; do not configure MIG manually on more than a handful of nodes.

The takeaway: MIG is the right tool when one job cannot fill a GPU. It is not a magic doubling of capacity, but it is the difference between a 30% utilized fleet and an 80% utilized fleet for inference workloads. The unglamorous gain you get for one config command per node.

See also

Updated 2026-05-10