Volume Migration Fails with Cannot Acquire State Change Lock Error

Problem

Block Storage volume migration via openstack volume migrate fails and the volume transitions to error state. The failure occurs during the final swap phase after the data copy has already completed. The Compute Service log on the source compute host shows:

/var/log/pf9/ostackhost.log
Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchConnectGetAllDomainStats)

The migration appears to complete the data copy successfully but then fails at the point where the Compute Service signals the volume attachment swap.

Environment

  • Private Cloud Director Virtualization - All versions

  • Self-Hosted Private Cloud Director Virtualization - All versions

  • Component: Block Storage Service, Compute Service

Cause

When a live volume migration runs, the Block Storage Service issues a swap_volume call to the Compute Service. The Compute Service responds by executing a _swap_volume operation in the Compute Service libvirt driver, which calls blockJobInfo() periodically to check whether the block copy is complete.

libvirt holds a global per-domain state lock during the execution of remoteDispatchConnectGetAllDomainStats — a routine that iterates through every running VM on the hypervisor to collect statistics. This routine is triggered by the Compute Service get_diagnostics API call and by other monitoring clients that poll the hypervisor for VM state.

If blockJobInfo() attempts to acquire this same lock while remoteDispatchConnectGetAllDomainStats is still running, the call waits up to 30 seconds. If the stats collection does not finish within 30 seconds, blockJobInfo() raises a libvirtError. Nova's original code treats this as a fatal error and marks the migration as error state, even though the underlying block copy completed successfully.

The issue is more likely to occur when:

  • The hypervisor is running a large number of VMs (more VMs = longer stats collection cycle = longer lock hold time)

  • Monitoring tooling or automation is polling get_diagnostics at high frequency

  • The storage backend is under elevated I/O load, which slows the stats collection cycle

Diagnostics

1

Step 1 — Check the failed volume state and migration status

Retrieve the volume details to confirm the migration failed and identify the source compute host.

Note the server_id from the attachments field — this is the VM UUID whose compute host will have the relevant logs.

2

Step 2 — Identify the source compute host

Retrieve the hypervisor where the attached VM is running.

SSH into [COMPUTE_HOST] to check the Hostagent log in Step 3.

3

Step 3 — Confirm the lock contention error in the Hostagent log

Search the Hostagent log on the source compute host for the failed VM UUID. Confirm the cannot acquire state change lock error is present.

The presence of remoteDispatchConnectGetAllDomainStats in the error confirms libvirt lock contention as the root cause. The block copy itself completed — the failure occurred only during the status check.

4

Step 4 — Verify the volume can be retried after resetting its migration state

Before applying the workaround, reset the failed volume to available or in-use so it can be migrated again. A volume stuck in error migration state must be reset before it accepts a new migration request.

For full details on resetting volumes from various stuck states, refer to How To Reset Volume States.

Workaround

The workaround patches the Compute Service libvirt driver on each compute host to handle lock contention gracefully. Instead of treating the 30-second timeout as a fatal error, the patched code catches the libvirtError and retries the status check after a short wait. Once remoteDispatchConnectGetAllDomainStats finishes and releases the lock, the retry succeeds and the migration completes normally.

Patch guest.py to Compute Hosts

1

Step 1 — Back up the original guest.py on the compute host

SSH into the compute host and create a timestamped backup of the original file before replacing it.

Confirm the backup exists before continuing:

2

Step 2 — Patched guest.py in the compute host

Patch the guest.py file in the compute host with the patch code that is highlighted below:

3

Step 3 — Restart the Hostagent and libvirt services

Restart both services to load the patched file.

4

Step 4 — Verify both services are running

Repeat Steps 1–4 on every compute host in the affected region.

5

Step 5 — Test a volume migration to confirm the fix

After patching all compute hosts, retry a migration for one of the previously failing volumes.

Monitor the migration status until it reaches success and the volume migstat field clears:

A successful migration with no VolumeRebaseFailed error in the Hostagent log confirms the patch is working.

Resolution

This is a known issue that was reported as PCD-7824. The patch guest.py mentioned in this article is an interim workaround validated in production environments. Contact Platform9 Support for the latest status on the official fix or for assistance applying the patch across a large number of compute hosts.

Additional Information

Please ensure that the changes are reapplied after each PCD upgrade, as they will be reverted during the upgrade process. This will be necessary until a permanent fix is included in a future release.

Last updated