GPU VM Resize and Cold Migration Fail with ReshapeFailed

Problem

GPU Passthrough VM resize and cold migration operations fail intermittently even when the destination compute host has sufficient physical GPU capacity. The Nova scheduler filters out all available GPU hosts, and the VM may enter ERROR state or remain stuck in a migration state.

Environment

  • Self-Hosted Private Cloud Director Virtualization — v2025.10-180 and above versions.

  • Private Cloud Director Virtualization — v2025.10-180 and above versions.

  • Components: GPU and Compute Service.

Diagnostics

1

Check for ReshapeFailed on in ostackhost.log on the source compute host:

/var/log/pf9/ostackhost.log
CRITICAL: nova.compute.manager  Resource provider data migration failed fatally during startup
for node [HOSTNAME].: nova.exception.ReshapeFailed: Resource provider inventory and
allocation data migration failed: {"errors": [{"status": 409, "title": "Conflict",
"detail": "Unable to create allocation for 'CUSTOM_PCI_[PRODUCT_ID]' on resource provider
'[RP_UUID]'. The requested amount would exceed the capacity."}]}

ERROR: oslo_service.backend.eventlet.service  Error starting thread: nova.exception.ReshapeFailed
2

Check if PF device is still reported as attached

/var/log/pf9/ostackhost.log
DEBUG nova.virt.libvirt.host  Cannot get MAC address of the PF [PCI_ADDRESS_1].It is probably attached to a guest already
DEBUG nova.virt.libvirt.host  Cannot get MAC address of the PF [PCI_ADDRESS_2].It is probably attached to a guest already
3

Check for Placement generation conflict

/var/log/pf9/ostackhost.log
resource provider generation conflict for provider [RP_UUID]: actual: [ACTUAL_GEN], given: [GIVEN_GEN]
4

Check scheduler filtering out all GPU hosts

/var/log/pf9/ostackhost.log
Filter PciPassthroughFilter returned 0 hosts
5

Check for instance stuck in error/delete state

/var/log/pf9/ostackhost.log
stats={'failed_builds': '[VALUE]', 'num_instances': '[VALUE]', 'num_vm_stopped': '[VALUE]',
       'num_vm_active': '[VALUE]', 'num_vm_error': '[VALUE]'}

vm_state='error', task_state='deleting'
6

Check for stale allocations in compute service logs

/var/log/pf9/ostackhost.log
Instance [INSTANCE_UUID] has allocations against this compute host but is not found in the database.

Cause

When a GPU VM resize or cold migration fails mid-workflow, Nova does not always clean up the Placement allocations tied to the failed operation. This results in:

  • Stale consumer allocations remaining on PCI child resource providers (e.g., CUSTOM_PCI_[VENDOR_ID]_[PRODUCT_ID]) for instance UUIDs that no longer exist in the Nova database.

  • Stale PCI child resource providers that are no longer valid but still registered in Placement, holding consumed inventory.

  • On the next nova-compute startup, Nova attempts a Placement reshape (inventory + allocation migration). If stale allocations exceed the declared inventory capacity, Placement returns HTTP 409 Conflict, causing ReshapeFailed and preventing the compute service from starting.

  • A stuck instance in vm_state='error', task_state='deleting' can continue to hold GPU/PCI allocations in Placement indefinitely, blocking all subsequent scheduling for the affected resource class.

Contributing Factors:

  • Failed resize/migration workflows (especially multi-GPU flavor changes) do not roll back Placement allocations atomically.

  • Placement generation conflicts during concurrent updates can leave allocations in an inconsistent state.

  • The nova-manage placement audit tool only detects orphans for VCPU/MEMORY — it may not catch all PCI child RP orphans.

  • Compute service restart after a failed reshape loop can compound the stale state.

Resolution

Option A — Restart compute service to re-register PCI RPs cleanly

After stale allocations have been cleaned up, restart the compute service to allow Nova to re-register PCI child resource providers with correct inventory.

1

Restart the compute service

2

Verify no 409 errors on startup

Option B — Use nova-manage to clean orphaned allocations

Note: nova-manage placement audit may not detect all PCI child RP orphans. If the issue persists after this step, proceed to Option C — Manual Placement API cleanup for stale PCI child resource providers.

Option C — Manual Placement API cleanup for stale PCI child resource providers

Use this method when nova-manage does not resolve the issue, or when the stale entry is a PCI child resource provider (not just a consumer allocation).

1

Get the Placement endpoint

2

Get an auth token

3

Inspect allocations on the stale PCI resource provider and get CONSUMER_UUID

4

For each consumer UUID in the output, verify it is stale

Expected output for a stale consumer: No server with a name or ID...

5

Delete the stale consumer allocation

6

Verify usage is now zero on the RP

7

Delete the stale child resource provider

After the cleanup, Nova will re-register the PCI child resource provider cleanly on the next pf9-ostackhost restart.

Validation

After performing the cleanup steps, validate in the following order:

1

Verify compute services are up on all GPU hosts

2

Verify PCI inventory is correctly reported

3

Check allocation candidates for a GPU flavor

4

Test normal VM cold migration (non-GPU) first.

5

Test GPU VM cold migration.

6

Test GPU VM flavor resize.

Additional Information

  • This is a known product gap tracked in PCD-6116. Currently, there is no automated reconciliation for stale PCI child resource providers. The manual cleanup steps described in this KB are the supported recovery method until the product fix is available.

  • The product fix will enable automated or controller-level cleanup of stale Placement allocations and PCI child resource providers.

Last updated