> For the complete documentation index, see [llms.txt](https://platform9.com/kb/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://platform9.com/kb/pcd/compute/volume-migration-fails-with-cannot-acquire-state-change-lock-error.md).

# Volume Migration Fails with Cannot Acquire State Change Lock Error

## Problem

Block Storage volume migration via `openstack volume migrate` fails and the volume transitions to `error` state. The failure occurs during the final swap phase after the data copy has already completed. The Compute Service log on the source compute host shows:

{% code title="/var/log/pf9/ostackhost.log" %}

```
Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchConnectGetAllDomainStats)
```

{% endcode %}

The migration appears to complete the data copy successfully but then fails at the point where the Compute Service signals the volume attachment swap.

## Environment

* Private Cloud Director Virtualization - **All versions**
* Self-Hosted Private Cloud Director Virtualization - **All versions**
* Component: Block Storage Service, Compute Service

## Cause

When a live volume migration runs, the Block Storage Service issues a `swap_volume` call to the Compute Service. The Compute Service responds by executing a `_swap_volume` operation in the Compute Service libvirt driver, which calls `blockJobInfo()` periodically to check whether the block copy is complete.

libvirt holds a **global per-domain state lock** during the execution of `remoteDispatchConnectGetAllDomainStats` — a routine that iterates through every running VM on the hypervisor to collect statistics. This routine is triggered by the Compute Service `get_diagnostics` API call and by other monitoring clients that poll the hypervisor for VM state.

If `blockJobInfo()` attempts to acquire this same lock while `remoteDispatchConnectGetAllDomainStats` is still running, the call waits up to 30 seconds. If the stats collection does not finish within 30 seconds, `blockJobInfo()` raises a `libvirtError`. Nova's original code treats this as a fatal error and marks the migration as `error` state, **even though the underlying block copy completed successfully**.

The issue is more likely to occur when:

* The hypervisor is running a large number of VMs (more VMs = longer stats collection cycle = longer lock hold time)
* Monitoring tooling or automation is polling `get_diagnostics` at high frequency
* The storage backend is under elevated I/O load, which slows the stats collection cycle

## Diagnostics

{% stepper %}
{% step %}

#### Step 1 — Check the failed volume state and migration status

Retrieve the volume details to confirm the migration failed and identify the source compute host.

{% code title="OpenStack CLI" %}

```bash
$ openstack volume show <VOLUME_UUID>
```

{% endcode %}

{% code title="Sample Output" %}

```bash
+--------------------------------------+----------------------------------------------------------+
| Field                                | Value                                                    |
+--------------------------------------+----------------------------------------------------------+
| os-vol-mig-status-attr:migstat       | error                                                    |
| os-vol-host-attr:host                | [CINDER_HOST]@[BACKEND]#[NFS_PATH]                       |
| status                               | in-use                                                   |
| attachments                          | [{'server_id': '[VM_UUID]', ...}]                        |
+--------------------------------------+----------------------------------------------------------+
```

{% endcode %}

Note the `server_id` from the `attachments` field — this is the VM UUID whose compute host will have the relevant logs.
{% endstep %}

{% step %}

#### Step 2 — Identify the source compute host

Retrieve the hypervisor where the attached VM is running.

{% code title="OpenStack CLI" %}

```bash
$ openstack server show <VM_UUID> | grep "OS-EXT-SRV-ATTR:host"
```

{% endcode %}

{% code title="Sample Output" %}

```bash
| OS-EXT-SRV-ATTR:host | [COMPUTE_HOST] |
```

{% endcode %}

SSH into `[COMPUTE_HOST]` to check the Hostagent log in [Step 3](#step-3--confirm-the-lock-contention-error-in-the-hostagent-log).
{% endstep %}

{% step %}

#### Step 3 — Confirm the lock contention error in the Hostagent log

Search the Hostagent log on the source compute host for the failed VM UUID. Confirm the `cannot acquire state change lock` error is present.

{% code title="Compute Host" %}

```bash
$ sudo grep -i "<VM_UUID>" /var/log/pf9/ostackhost.log | grep -i "state change lock\|VolumeRebaseFailed"
$ sudo zcat /var/log/pf9/ostackhost.log.*.gz | grep -i "<VM_UUID>" | grep -i "state change lock\|VolumeRebaseFailed"
```

{% endcode %}

{% code title="Sample Output" %}

```bash
[TIMESTAMP] ERROR nova.compute.manager [REQ_UUID] [VM_UUID] Failed to swap volume [OLD_VOLUME_UUID] for [NEW_VOLUME_UUID]: nova.exception.VolumeRebaseFailed: Volume rebase failed: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchConnectGetAllDomainStats)
```

{% endcode %}

The presence of `remoteDispatchConnectGetAllDomainStats` in the error confirms libvirt lock contention as the root cause. The block copy itself completed — the failure occurred only during the status check.
{% endstep %}

{% step %}

#### Step 4 — Verify the volume can be retried after resetting its migration state

Before applying the workaround, reset the failed volume to `available` or `in-use` so it can be migrated again. A volume stuck in `error` migration state must be reset before it accepts a new migration request.

{% hint style="info" %}
For full details on resetting volumes from various stuck states, refer to [How To Reset Volume States](https://platform9.com/kb/pcd/storage/how-to-reset-volume-states).
{% endhint %}

{% code title="OpenStack CLI" %}

```bash
$ openstack volume set --state in-use <VOLUME_UUID>
$ openstack volume show <VOLUME_UUID> | grep "migstat\|status"
```

{% endcode %}

{% code title="Sample Output" %}

```bash
| os-vol-mig-status-attr:migstat | None   |
| status                         | in-use |
```

{% endcode %}
{% endstep %}
{% endstepper %}

## Workaround

The workaround patches the Compute Service libvirt driver on each compute host to handle lock contention gracefully. Instead of treating the 30-second timeout as a fatal error, the patched code catches the `libvirtError` and retries the status check after a short wait. Once `remoteDispatchConnectGetAllDomainStats` finishes and releases the lock, the retry succeeds and the migration completes normally.

{% hint style="warning" %}
Apply this patch to **every compute host** in the region where volume migration failures are occurring — not just the compute host where the last failure was observed. Any hypervisor running VMs with attached volumes can be the source host for a future migration.
{% endhint %}

### Patch guest.py to Compute Hosts

{% stepper %}
{% step %}

#### Step 1 — Back up the original guest.py on the compute host

SSH into the compute host and create a timestamped backup of the original file before replacing it.

{% code title="Compute Host" %}

```bash
$ sudo cp /opt/pf9/venv/lib/python3.9/site-packages/nova/virt/libvirt/guest.py \
    /opt/pf9/venv/lib/python3.9/site-packages/nova/virt/libvirt/guest.py.bak.$(date +%Y%m%d)
```

{% endcode %}

Confirm the backup exists before continuing:

{% code title="Compute Host" %}

```bash
$ ls -lh /opt/pf9/venv/lib/python3.9/site-packages/nova/virt/libvirt/guest.py*
```

{% endcode %}
{% endstep %}

{% step %}

#### Step 2 — Patched guest.py in the compute host

Patch the `guest.py` file in the compute host with the patch code that is highlighted below:

<pre class="language-py" data-title="/opt/pf9/venv/lib/python3.9/site-packages/nova/virt/libvirt/guest.py" data-overflow="wrap"><code class="lang-py">    def is_job_complete(self):
        """Return True if the job is complete, False otherwise

        :returns: True if the job is complete, False otherwise
        :raises: libvirt.libvirtError on error fetching block job info
        """
        # NOTE(mdbooth): This method polls for block job completion. It returns
        # true if either we get a status which indicates completion, or there
        # is no longer a record of the job. Ideally this method and its
        # callers would be rewritten to consume libvirt events from the job.
        # This would provide a couple of advantages. Firstly, as it would no
        # longer be polling it would notice completion immediately rather than
        # at the next 0.5s check, and would also consume fewer resources.
        # Secondly, with the current method we only know that 'no job'
        # indicates completion. It does not necessarily indicate successful
        # completion: the job could have failed, or been cancelled. When
        # polling for block job info we have no way to detect this, so we
        # assume success.

<strong>        try:
</strong><strong>            status = self.get_job_info()
</strong><strong>        except libvirt.libvirtError as e:
</strong><strong>            if 'cannot acquire state change lock' in str(e):
</strong><strong>                # The domain lock is temporarily held by another libvirt
</strong><strong>                # operation (e.g. virConnectGetAllDomainStats from
</strong><strong>                # get_diagnostics). The block copy job is still running;
</strong><strong>                # we just cannot query its status right now. Return False
</strong><strong>                # so the caller retries after the next 0.5s sleep.
</strong><strong>                LOG.warning('Transient libvirt lock contention polling block '
</strong><strong>                            'job status for %(disk)s, will retry: %(err)s',
</strong><strong>                            {'disk': self._disk, 'err': e})
</strong><strong>                return False
</strong><strong>            raise
</strong>
        # If the job no longer exists, it is because it has completed

</code></pre>

{% endstep %}

{% step %}

#### Step 3 — Restart the Hostagent and libvirt services

Restart both services to load the patched file.

{% hint style="warning" %}
Restarting `pf9-ostackhost` briefly interrupts the Compute Service worker on this host. Confirm no critical VM operations are in flight before proceeding. Restarting `libvirtd` briefly disconnects libvirt from running VMs but does not affect VM uptime.
{% endhint %}

{% code title="Compute Host" %}

```bash
$ sudo systemctl restart libvirtd
$ sudo systemctl restart pf9-ostackhost
```

{% endcode %}
{% endstep %}

{% step %}

#### Step 4 — Verify both services are running

{% code title="Compute Host" %}

```bash
$ sudo systemctl status libvirtd pf9-ostackhost
```

{% endcode %}

{% code title="Sample Output" %}

```bash
● libvirtd.service - Virtualization daemon
   Loaded: loaded (/lib/systemd/system/libvirtd.service; enabled)
   Active: active (running) since [TIMESTAMP]; [UPTIME] ago

● pf9-ostackhost.service - Platform9 OpenStack Host Agent
   Loaded: loaded (/lib/systemd/system/pf9-ostackhost.service; enabled)
   Active: active (running) since [TIMESTAMP]; [UPTIME] ago
```

{% endcode %}

Repeat Steps 1–4 on every compute host in the affected region.
{% endstep %}

{% step %}

#### Step 5 — Test a volume migration to confirm the fix

After patching all compute hosts, retry a migration for one of the previously failing volumes.

{% code title="OpenStack CLI" %}

```bash
$ openstack volume migrate <VOLUME_UUID> --host <TARGET_BACKEND_HOST>
```

{% endcode %}

Monitor the migration status until it reaches `success` and the volume `migstat` field clears:

{% code title="OpenStack CLI" %}

```bash
$ openstack volume show <VOLUME_UUID> | grep "migstat\|status"
```

{% endcode %}

{% code title="Sample Output — migration in progress" %}

```bash
| os-vol-mig-status-attr:migstat | migrating |
| status                         | in-use    |
```

{% endcode %}

{% code title="Sample Output — migration complete" %}

```bash
| os-vol-mig-status-attr:migstat | None         |
| os-vol-host-attr:host          | [NEW_BACKEND] |
| status                         | in-use        |
```

{% endcode %}

A successful migration with no `VolumeRebaseFailed` error in the Hostagent log confirms the patch is working.
{% endstep %}
{% endstepper %}

## Resolution

This is a known issue that was reported as **PCD-7824.** The patch `guest.py` mentioned in this article is an interim workaround validated in production environments. Contact [Platform9 Support](https://support.platform9.com/) for the latest status on the official fix or for assistance applying the patch across a large number of compute hosts.

## Additional Information

Please ensure that the changes are reapplied after each PCD upgrade, as they will be reverted during the upgrade process. This will be necessary until a permanent fix is included in a future release.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://platform9.com/kb/pcd/compute/volume-migration-fails-with-cannot-acquire-state-change-lock-error.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
