Migrated VMs are Showing File System Errors

Problem

Multiple virtual machines (VMs) failed to boot due to file system corruption on their boot volumes. Both Windows and Linux VMs exhibited boot errors, showing file system inconsistencies.

Console logs
    
 
BdsDxe: loading Boot0001 "UEFI Misc Device" from PciRoot(0x0)/Pci(0x5,0x0)BdsDxe: starting Boot0001 "UEFI Misc Device" from PciRoot(0x0)/Pci(0x5,0x0)
Copy

Environment

Private Cloud Director Virtualization - v2025.7 & v2025.8
Self-Hosted Private Cloud Director Virtualization - v2025.7 & v2025.8
Component: VMHA

Workaround

Identify Impacted VMs:
1. Review VM boot console output for file system errors or stuck boot sequences.
2. Cross-check recent migrations or evacuations triggered by VMHA from the ostackhost logs (/var/log/pf9/ostackhost.logs) from the source host.

ostackhost.logs
    
INFO nova.compute.manager [req-UUID masakari services] [instance: [INSTANCE_ID]] Evacuating instanceINFO nova.compute.claims [req-UUID masakari services] [instance: [INSTANCE_ID]] Claim successful on node <COMPUTE_HOST>INFO nova.compute.resource_tracker [req-UUID masakari services] [instance:  [INSTANCE_ID]] Updating resource usage from migration <MIGRATION_ID>ERROR nova.scheduler.client.report [req-UUID masakari services] [req-UUID] Failed to update inventory to [{'MEMORY_MB': {'total': 2063836, 'min_unit': 1, 'max_unit': 2063836, 'step_size': 1, 'allocation_ratio': 1.5, 'reserved': 512}, 'VCPU': {'total': 128, 'min_unit': 1, 'max_unit': 128, 'step_size': 1, 'allocation_ratio': 16.0, 'reserved': 0}, 'DISK_GB': {'total': 86809, 'min_unit': 1, 'max_unit': 86809, 'step_size': 1, 'allocation_ratio': 9999.0, 'reserved': 0}}] for resource provider with UUID <UUID>.  Got 409: {"errors": [{"status": 409, "title": "Conflict", "detail": "There was a conflict when trying to complete your request.\n\n resource provider generation conflict  ", "code": "placement.concurrent_update", "request_id": "req-UUID"}]}ERROR nova.scheduler.client.report [req-UUID masakari services] [req-UUID] Failed to update inventory to [{'MEMORY_MB': {'total': 2063836, 'min_unit': 1, 'max_unit': 2063836, 'step_size': 1, 'allocation_ratio': 1.5, 'reserved': 512}, 'VCPU': {'total': 128, 'min_unit': 1, 'max_unit': 128, 'step_size': 1, 'allocation_ratio': 16.0, 'reserved': 0}, 'DISK_GB': {'total': 86808, 'min_unit': 1, 'max_unit': 86808, 'step_size': 1, 'allocation_ratio': 9999.0, 'reserved': 0}}] for resource provider with UUID <UUID>.  Got 409: {"errors": [{"status": 409, "title": "Conflict", "detail": "There was a conflict when trying to complete your request.\n\n resource provider generation conflict  ", "code": "placement.concurrent_update", "request_id": "req-UUID"}]}
Copy

Temporarily Disable VMHA:
1. Disable VM High Availability (VMHA) across the affected clusters to prevent automatic evacuations.
2. Confirm no ongoing evacuation events in logs (/var/log/pf9/ha/ha-agent.log) from the source host.
Recover the VMs:
1. For Linux VMs: Boot into recovery mode and run fsck -y /dev/<BOOT-VOLUME>
2. For Windows VMs: Mount the volume to a helper instance, restore critical registry files, and reattach to the original VM or using a file system checker tool.
Validate Cinder and NFS Mounts:
1. Check all Cinder hosts for correct NFS mount configurations as per the cluster blueprints.

VMHA (Virtual Machine High Availability) evacuation events triggered by network connectivity issues resulted in multiple VM instances simultaneously accessing the same storage volumes, causing data corruption.

Resolution

The long-term fix is available on PCD-October release (tracked under PCD-4211, PCD-4212).

Last updated on

Was this page helpful?

Migrated VMs are Showing File System Errors

Problem

Environment

Workaround

Cause

Resolution