GPU Partitioning Strategies

You can enable GPU support in your Kubernetes clusters to run AI/ML, data science, and media processing workloads. GPU support allows you to partition physical GPUs efficiently, maximizing resource utilization and reducing costs.

Learn more about how to Create Virtualized Cluster with GPU support by configuring GPU partitioning strategies and monitoring GPU resources.

GPU Partitioning Strategies

Before creating your GPU cluster, understand the three available partitioning strategies. Each strategy serves different workload requirements and resource efficiency needs.

Passthrough

Passthrough assigns an entire physical GPU directly to a single workload, bypassing any virtualization layer. This strategy delivers near-native performance because the workload has exclusive access to all GPU cores, memory, and processing power.

Use Passthrough when:

  • You need maximum GPU performance for intensive workloads
  • Your applications require exclusive GPU access
  • You run large-scale training jobs or high-performance computing tasks
  • Resource sharing isn't a priority

Limitations:

  • One workload per GPU, which can lead to underutilization
  • Higher cost per workload due to dedicated resource allocation
  • No ability to run multiple smaller workloads simultaneously

MIG (Multi-Instance GPU)

MIG provides hardware-level partitioning that divides a single GPU into multiple isolated GPU instances. Each instance has dedicated streaming multiprocessors (SMs), memory, cache, and copy engines, ensuring complete isolation between workloads.

Use MIG when:

  • You need guaranteed resource isolation between workloads
  • Multiple teams or applications share the same physical GPU
  • You want to maximize GPU utilization while maintaining performance boundaries
  • Security and tenant isolation are critical requirements

Key features:

  • Each MIG instance appears as a separate GPU to applications
  • Memory and compute resources are physically partitioned
  • Profiles determine the size of each instance (1g, 2g, 3g, 4g, 7g configurations)
  • Available only on modern GPUs (Ampere architecture or later)

Example: An H100 GPU can be partitioned into one 4g.47gb instance and two 1g.24 gb instances, utilizing 6 out of 7 GPU compute units and 94GB of available memory.

MIG is supported on GPUs starting with the NVIDIA Ampere generation only. Learn more about MIG supported GPUs.

Time Slicing

Time Slicing multiplexes multiple workloads on a single GPU by granting each exclusive access for short time periods. The GPU operator schedules workloads sequentially, allowing multiple pods to share the same physical GPU resources.

Use Time Slicing when:

  • You have bursty or intermittent GPU workloads
  • Applications don't require continuous GPU access
  • You want to increase GPU utilization for development and testing
  • Cost optimization is more important than guaranteed performance

Important considerations:

  • No memory isolation between workloads
  • Performance depends on workload scheduling and resource contention
  • Best suited for inference workloads rather than training
  • You can configure 2-16 replicas per GPU

Example: A GPU configured with 4 time slices allows 4 different pods to run GPU workloads sequentially, with each pod getting shared access during its allocated time window.

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard
  Last updated