GPU VM Cold Migration / Resize Fails: Insufficient Compute Resources

Problem

A GPU Passthrough VM transitions to the ERROR state during cold migration or resize. The operation fails even when the destination host has sufficient physical GPU resources available.

Environment

  • Self-Hosted Private Cloud Director Virtualization — v2025.10-180 and above

  • Private Cloud Director Virtualization — v2025.10-180 and above

  • Components: GPU and Compute Service

Diagnostics

  • Check the PCI configuration, and verify the GPU passthrough device (type-PF).

    /opt/pf9/etc/nova/nova.conf
    [pci]
    alias = {"vendor_id": "[VENDOR_ID]", "product_id": "[PRODUCT_ID]", "device_type": "type-PF","name": "[DEVICE_NAME]", "live_migratable": "yes"}  
  • Check ostackhost.log on the destination compute host for the following log entries:

    /var/log/pf9/ostackhost.log
    INFO  nova.compute.claims   [instance: [INSTANCE_UUID]] Failed to claim: Claim pci failed
    ERROR nova.compute.manager  [instance: [INSTANCE_UUID]] Setting instance vm_state to ERROR: nova.exception.ComputeResourcesUnavailable: Insufficient compute resources: Claim pci failed.
  • Enable debug logging and re-check ostackhost.log. With debug logging enabled, look for:

    /var/log/pf9/ostackhost.log
    DEBUG nova.pci.stats  PCI claim: Starting with 1 devices, request count: 1
    DEBUG nova.pci.stats  PCI claim: Request spec: [{'vendor_id': '[VENDOR_ID]', 'product_id': '[PRODUCT_ID]','dev_type': 'type-PF', 'live_migratable': 'true'}]
    DEBUG nova.pci.utils  PCI attribute check: key=live_migratable, spec_value=true, device_value=None
    DEBUG nova.pci.utils  PCI attribute mismatch: live_migratable spec="true" != device="None"
    DEBUG nova.pci.stats  Dropped 1 device(s) due to mismatched PCI attribute(s)
    DEBUG nova.pci.stats  Not enough PCI devices left to satisfy request after spec filtering
    DEBUG nova.pci.stats  PCI claim: Available: 0, Required: 1

Key Indicator: The live_migratable attribute is present in the PCI request spec (true) but is absent (None) from the available device pool — causing Nova to drop all matching devices and fail the claim.

Cause

This is a regression introduced in v2025.10-180, where the live_migratable=yes flag was introduced to the PCI alias configuration (nova.conf) for GPU passthrough devices (type-PF). This flag is only valid for type-VF (vGPU / SR-IOV Virtual Function) devices that support live migration via kernel variant drivers.

For type-PF (PCI Passthrough / Physical Function) devices, live migration is NOT supported. When live_migratable=yes is set in the alias, Nova's PCI filter includes this attribute in the request spec. Since the physical device pool does not carry the live_migratable attribute, the comparison fails, and all available GPU devices are dropped — resulting in Available: 0.

Affected Configuration

The problematic configuration is present in /opt/pf9/etc/nova/nova.conf on GPU compute hosts in v2025.10-180:

Attribute Mismatch Flow

During the issue, the below configuration is fed.

However, the below configurations are only supported in the device pool configurations:

As a result:

Resolution

  • Live migration for GPU passthrough (type-PF) is not supported yet; live_migratable is reserved for the future type-VF (vGPU) support and is tracked in PCD-6249.

Workaround

1

Remove device_spec from nova.conf under [pci] section

Only remove the device_spec line — do not delete the entire [pci] section.

2

Add the corrected PCI config to nova-override.conf (without live_migratable)

Do NOT include live_migratable in the alias.

3

Remove any existing pci.conf from conf.d - if present.

4

Restart the Compute Service

5

Verify the Hostagent convergence state.

6

Validate cold migration and resize

After the service restarts successfully, test the following:

  • Cold migration of a GPU VM between hosts

  • GPU flavor resize (e.g., 8 GPU → 4 GPU)

Validation

To confirm the fix is applied, Enable debug logging and re-check ostackhost.log to verify the PCI claim no longer includes live_migratable in the request spec. A successful claim should show

Compare with the failure pattern in the Diagnosticssection.

Additional Information

  • This issue affects all GPU PCI passthrough configurations using type-PF in v2025.10-180 — not just NVIDIA H200.


Last updated