GPU VM Resize and Cold Migration Fail with ReshapeFailed
Problem
Environment
Diagnostics
1
CRITICAL: nova.compute.manager Resource provider data migration failed fatally during startup
for node [HOSTNAME].: nova.exception.ReshapeFailed: Resource provider inventory and
allocation data migration failed: {"errors": [{"status": 409, "title": "Conflict",
"detail": "Unable to create allocation for 'CUSTOM_PCI_[PRODUCT_ID]' on resource provider
'[RP_UUID]'. The requested amount would exceed the capacity."}]}
ERROR: oslo_service.backend.eventlet.service Error starting thread: nova.exception.ReshapeFailed2
DEBUG nova.virt.libvirt.host Cannot get MAC address of the PF [PCI_ADDRESS_1].It is probably attached to a guest already
DEBUG nova.virt.libvirt.host Cannot get MAC address of the PF [PCI_ADDRESS_2].It is probably attached to a guest already3
resource provider generation conflict for provider [RP_UUID]: actual: [ACTUAL_GEN], given: [GIVEN_GEN]4
Filter PciPassthroughFilter returned 0 hosts5
stats={'failed_builds': '[VALUE]', 'num_instances': '[VALUE]', 'num_vm_stopped': '[VALUE]',
'num_vm_active': '[VALUE]', 'num_vm_error': '[VALUE]'}
vm_state='error', task_state='deleting'6
Instance [INSTANCE_UUID] has allocations against this compute host but is not found in the database.Cause
Resolution
Option A — Restart compute service to re-register PCI RPs cleanly
1
2
Option B — Use nova-manage to clean orphaned allocations
nova-manage to clean orphaned allocationsOption C — Manual Placement API cleanup for stale PCI child resource providers
1
2
3
4
5
6
7
Validation
1
2
3
4
5
6
Additional Information
PreviousGPU VM Cold Migration / Resize Fails: Insufficient Compute ResourcesNextHow to Modify "FailedJobsHistoryLimit" or "SuccessfulJobsHistoryLimit" Settings for CronJobs?
Last updated
