GPU Partitioning Strategies

You can enable GPU support in your Kubernetes clusters to run AI/ML, data science, and media processing workloads. GPU support allows you to partition physical GPUs efficiently, maximizing resource utilization and reducing costs.

Learn more about how to Create Virtualized Cluster with GPU support by configuring GPU partitioning strategies and monitoring GPU resources.

GPU Partitioning Strategies

Before creating your GPU cluster, understand the three available partitioning strategies. Each strategy serves different workload requirements and resource efficiency needs.

Passthrough

Passthrough assigns an entire physical GPU directly to a single workload, bypassing any virtualization layer. This strategy delivers near-native performance because the workload has exclusive access to all GPU cores, memory, and processing power.

Use Passthrough when:

You need maximum GPU performance for intensive workloads
Your applications require exclusive GPU access
You run large-scale training jobs or high-performance computing tasks
Resource sharing isn't a priority

Limitations:

One workload per GPU, which can lead to underutilization
Higher cost per workload due to dedicated resource allocation
No ability to run multiple smaller workloads simultaneously

MIG (Multi-Instance GPU)

MIG provides hardware-level partitioning that divides a single GPU into multiple isolated GPU instances. Each instance has dedicated streaming multiprocessors (SMs), memory, cache, and copy engines, ensuring complete isolation between workloads.

Use MIG when:

You need guaranteed resource isolation between workloads
Multiple teams or applications share the same physical GPU
You want to maximize GPU utilization while maintaining performance boundaries
Security and tenant isolation are critical requirements

Key features:

Each MIG instance appears as a separate GPU to applications
Memory and compute resources are physically partitioned
Profiles determine the size of each instance (1g, 2g, 3g, 4g, 7g configurations)
Available only on modern GPUs (Ampere architecture or later)

Example: An H100 GPU can be partitioned into one 4g.47gb instance and two 1g.24 gb instances, utilizing 6 out of 7 GPU compute units and 94GB of available memory.

MIG is supported on GPUs starting with the NVIDIA Ampere generation only. Learn more about MIG supported GPUs.

Time Slicing

Time Slicing multiplexes multiple workloads on a single GPU by granting each exclusive access for short time periods. The GPU operator schedules workloads sequentially, allowing multiple pods to share the same physical GPU resources.

Use Time Slicing when:

You have bursty or intermittent GPU workloads
Applications don't require continuous GPU access
You want to increase GPU utilization for development and testing
Cost optimization is more important than guaranteed performance

Important considerations:

No memory isolation between workloads
Performance depends on workload scheduling and resource contention
Best suited for inference workloads rather than training
You can configure 2-16 replicas per GPU

Example: A GPU configured with 4 time slices allows 4 different pods to run GPU workloads sequentially, with each pod getting shared access during its allocated time window.

Last updated on Oct 22, 2025

Was this page helpful?