Migrated VMs are Showing File System Errors
Problem
Multiple virtual machines (VMs) failed to boot due to file system corruption on their boot volumes. Both Windows and Linux VMs exhibited boot errors, showing file system inconsistencies.
BdsDxe: loading Boot0001 "UEFI Misc Device" from PciRoot(0x0)/Pci(0x5,0x0)BdsDxe: starting Boot0001 "UEFI Misc Device" from PciRoot(0x0)/Pci(0x5,0x0)Environment
- Private Cloud Director Virtualization - v2025.7 & v2025.8
- Self-Hosted Private Cloud Director Virtualization - v2025.7 & v2025.8
- Component: VMHA
Workaround
- Identify Impacted VMs:
- Review VM boot console output for file system errors or stuck boot sequences.
- Cross-check recent migrations or evacuations triggered by VMHA from the ostackhost logs (
/var/log/pf9/ostackhost.logs) from the source host.
INFO nova.compute.manager [req-UUID masakari services] [instance: [INSTANCE_ID]] Evacuating instanceINFO nova.compute.claims [req-UUID masakari services] [instance: [INSTANCE_ID]] Claim successful on node <COMPUTE_HOST>INFO nova.compute.resource_tracker [req-UUID masakari services] [instance: [INSTANCE_ID]] Updating resource usage from migration <MIGRATION_ID>ERROR nova.scheduler.client.report [req-UUID masakari services] [req-UUID] Failed to update inventory to [{'MEMORY_MB': {'total': 2063836, 'min_unit': 1, 'max_unit': 2063836, 'step_size': 1, 'allocation_ratio': 1.5, 'reserved': 512}, 'VCPU': {'total': 128, 'min_unit': 1, 'max_unit': 128, 'step_size': 1, 'allocation_ratio': 16.0, 'reserved': 0}, 'DISK_GB': {'total': 86809, 'min_unit': 1, 'max_unit': 86809, 'step_size': 1, 'allocation_ratio': 9999.0, 'reserved': 0}}] for resource provider with UUID <UUID>. Got 409: {"errors": [{"status": 409, "title": "Conflict", "detail": "There was a conflict when trying to complete your request.\n\n resource provider generation conflict ", "code": "placement.concurrent_update", "request_id": "req-UUID"}]}ERROR nova.scheduler.client.report [req-UUID masakari services] [req-UUID] Failed to update inventory to [{'MEMORY_MB': {'total': 2063836, 'min_unit': 1, 'max_unit': 2063836, 'step_size': 1, 'allocation_ratio': 1.5, 'reserved': 512}, 'VCPU': {'total': 128, 'min_unit': 1, 'max_unit': 128, 'step_size': 1, 'allocation_ratio': 16.0, 'reserved': 0}, 'DISK_GB': {'total': 86808, 'min_unit': 1, 'max_unit': 86808, 'step_size': 1, 'allocation_ratio': 9999.0, 'reserved': 0}}] for resource provider with UUID <UUID>. Got 409: {"errors": [{"status": 409, "title": "Conflict", "detail": "There was a conflict when trying to complete your request.\n\n resource provider generation conflict ", "code": "placement.concurrent_update", "request_id": "req-UUID"}]}Temporarily Disable VMHA:
- Disable VM High Availability (VMHA) across the affected clusters to prevent automatic evacuations.
- Confirm no ongoing evacuation events in logs (
/var/log/pf9/ha/ha-agent.log) from the source host.
Recover the VMs:
- For Linux VMs: Boot into recovery mode and run
fsck -y /dev/<BOOT-VOLUME> - For Windows VMs: Mount the volume to a helper instance, restore critical registry files, and reattach to the original VM or using a file system checker tool.
- For Linux VMs: Boot into recovery mode and run
Validate Cinder and NFS Mounts:
- Check all Cinder hosts for correct NFS mount configurations as per the cluster blueprints.
Cause
VMHA (Virtual Machine High Availability) evacuation events triggered by network connectivity issues resulted in multiple VM instances simultaneously accessing the same storage volumes, causing data corruption.
Resolution
The long-term fix is available on PCD-October release (tracked under PCD-4211, PCD-4212).