# Virtual Machine Migration Issues

## Problem

This guide provides step-by-step instructions for troubleshooting and resolving issues that arise when a virtual machine (VM) migration fails.

## Environment

* Private Cloud Director Virtualization&#x20;
* Self-Hosted Private Cloud Director Virtualization<br>

## Various VM Migration Methods

* **Live Migration** (Shared Storage): Move a running instance seamlessly between compute nodes that share the same underlying storage backend (like Ceph or NFS).
* **Block Live Migration**: Move a running instance, including its local disk, to another compute node over the network without requiring shared storage.
* **Cold Migration**: Shut down the instance, move its data to a new host, and power it back on. Often used for hardware maintenance or resizing.
* **Evacuation**: Rescue instances from a failed compute node by rebuilding them on an available, healthy host.

## Deep Dive

The Private Cloud Director VM migration process is orchestrated primarily by the Compute service (Nova). This flow involves a complex interaction between the source host, the destination host, and the management plane to ensure data consistency and minimal network interruption.

{% hint style="info" %}
NOTE

The logs below can only be reviewed in Self-Hosted Private Cloud Director.&#x20;
{% endhint %}

#### Step 1: User Request & API Validation

This is the initial stage where the migration request is received and validated.

* User Request: A user submits a request to migrate a VM. The `nova-api-osapi` pod validates the authentication token with Keystone.
* State Check: Nova API checks if the VM is in a valid state for the requested migration type (e.g., `ACTIVE` for live migration). The output below shows the migration action was accepted with a `202` status.

  <pre class="language-bash" data-title="Sample Logs"><code class="lang-bash">$ kubectl logs deployment/nova-api-osapi -n &#x3C;WORKLOAD_REGION> | grep "POST /v2.1" 
  INFO nova.osapi_compute.wsgi.server [None [REQ_ID] [USER_ID] [TENANT_ID] - - default default] [IP] "POST /v2.1/[tenant_id]/servers/[VM_UUID]/action HTTP/1.1" status: 202 len: [.] time: [.]
  </code></pre>

{% hint style="info" %}
Here a unique **`REQ_ID`** will be generated, which will be further used for tracking the request in the other component log
{% endhint %}

* Database Update: The `nova-conductor` service updates the database, setting the VM task state to `migrating`.<br>

#### Step 2: Scheduling & Destination Selection

Unless a specific target host is forced by the user, the request goes to the Nova Scheduler.

* Host Filtering: The scheduler evaluates available compute nodes, filtering out the source host and checking for sufficient RAM, CPU, and compatible CPU architectures.

  <pre class="language-bash" data-title="Sample Logs"><code class="lang-bash">$ kubectl logs deployment/nova-scheduler -n &#x3C;WORKLOAD_REGION> | grep &#x3C;REQ_ID>
  INFO nova.scheduler.filter_scheduler [None [REQ_ID] [USER_ID] [TENANT_ID] - - default default] [instance: [VM_UUID]] Filter ComputeFilter returned 2 host(s)
  </code></pre>
* Placement Claim: The scheduler queries the Placement API to reserve resources on the selected destination host.

  <pre class="language-bash" data-title="Sample Logs"><code class="lang-bash">$ kubectl logs deployment/placement-api -n &#x3C;WORKLOAD_REGION> | grep &#x3C;REQ_ID>
  INFO placement.requestlog [[REQ_ID] [REQ_ID] [USER_ID] [TENANT_ID] - - default default] [IP] "PUT /allocations/[VM_UUID]" status: 204 len: 0 microversion: 1.36
  </code></pre>

#### Step 3: Pre-Migration & Resource Setup

The source and destination `pf9-ostackhost` services begin communicating to prepare the environment.

* Destination Setup: The destination  `pf9-ostackhost`  creates the necessary XML definitions for the hypervisor (Libvirt) and prepares virtual network interfaces (VIFs) by communicating with Neutron (Networking Service).

  <pre class="language-bash" data-title="Sample Log"><code class="lang-bash">$ less /var/log/pf9/ostackhost.log | grep &#x3C;REQ_ID>
  INFO nova.compute.manager [[REQ_ID] [USERNAME] service] [instance: [VM_UUID]] pre_live_migration data is {'graphics_listen_addr_vnc': '[DEST_IP]', 'graphics_listen_addr_spice': '[DEST_IP]', 'serial_listen_addr': '[DEST_IP]', 'target_connect_addr': '[DEST_IP]', 'macs': ['[MAC_ADDRESS]']}
  </code></pre>
* Storage Setup: If volumes are attached, Cinder is called to ensure the destination host has the necessary storage multipath connections or iSCSI sessions established.

{% hint style="warning" %}
For Live Migration, Libvirt on the source host must be able to securely communicate with Libvirt on the destination host via TLS. Certificate validation failures (expired certs, untrusted CA, hostname mismatches) are a leading cause of migration aborted errors.
{% endhint %}

#### Step 4: Memory/Disk Transfer & Cutover

\
The actual movement of data occurs at the hypervisor level

* Data Transfer (Live): Libvirt continuously copies the VM's memory pages from the source to the destination over the TLS-secured connection. If Block Live Migration is used, the disk is also copied over the network.
* Switchover: Once the memory is fully synchronized, the VM is paused on the source for a fraction of a second, final changes are copied, and the VM resumes on the destination.

  <pre class="language-bash" data-title="Sample Logs"><code class="lang-bash">$ less /var/log/pf9/ostackhost.log | grep &#x3C;REQ_ID>
  INFO nova.compute.manager [[REQ_ID] [USERNAME] service] [instance: [VM_UUID]] _post_live_migration() is started..
  INFO nova.compute.manager [[REQ_ID] [USERNAME] service] [instance: [VM_UUID]] Live migration succeeded.
  </code></pre>
* Network Update: Neutron updates the network port bindings to the new host and sends a Gratuitous ARP (GARP) to update physical network switches.
* Cleanup: The source `nova-compute` destroys the old Libvirt domain and cleans up local storage. `nova-conductor` updates the VM state back to `ACTIVE` on the new host.

## Procedure

#### 1. Get the VM and Migration Status

Use the CLI to check the error message and the specific migration record.

{% hint style="info" %}
Look for the `fault` field in the server show output, and check the `status` in the migration list.
{% endhint %}

```bash
$ openstack server show <VM_UUID>
$ openstack server migration list --server <VM_UUID>
```

#### 2.  Validate Compute & Storage Service Status

Ensure the compute services on both the source and target hosts are up and enabled.

```bash
$ openstack compute service list
$ openstack volume service list 
```

#### 3. Trace the VM Events

Retrieve the Request ID (`REQ_ID`) from the server event list to track the specific migration failure across different component logs.

```bash
$ openstack server event list <VM_UUID>
$ openstack server event show <VM_UUID> <REQ_ID>
```

#### 4. Review the Pods and its logs on the Management plane

{% hint style="info" %}
This step is applicable only for the Self-Hosted Private Cloud Director.
{% endhint %}

Check the management plane pods to see if the Scheduler failed to find a host or if the Conductor hit an error. Review pod logs using the `REQ_ID`.

```bash
$ kubectl get pods -o wide -n <WORKLOAD_REGION> | grep -i "nova"
$ kubectl logs -n <WORKLOAD_REGION> <NOVA_SCHEDULER_POD> | grep <REQ_ID>
$ kubectl logs -n <WORKLOAD_REGION> <NOVA_CONDUCTOR_POD> | grep <REQ_ID>
```

#### 5. Validate Target Host Capacity & Requirements

Check if the destination hypervisor actually has the resources and matches the CPU architecture of the source host.

```bash
$ openstack hypervisor stats show
$ openstack hypervisor show <TARGET_HYPERVISOR_NAME>
```

#### 6. Validate Availability Zones & Host Aggregates

Ensure the destination host belongs to the correct Availability Zone and possesses the required aggregate metadata matching the VM's flavor.

```bash
# 1. Check the VM's current Availability Zone
$ openstack server show <VM_UUID> 

# 2. Check the Target Host's assigned aggregates and AZs
$ openstack availability zone list --long
$ openstack aggregate list --long

# 3. (Optional) If the VM uses a specific flavor, verify if it requires specific aggregate metadata
$ openstack flavor show <FLAVOR_ID> 
```

#### 7. Validate the service status on the Source and Target Hypervisors

Validate that the compute and virtualization daemon are running on both compute nodes involved in the migration. If the VM has attached volumes, you must also verify the Cinder volume service on the node holding the persistent storage role.

```bash
# On Source and Target Compute Nodes:
$ sudo systemctl status pf9-hostagent
$ sudo systemctl status pf9-ostackhost 
$ sudo systemctl status libvirtd
$ sudo systemctl status ovn-controller
$ sudo systemctl status pf9-neutron-ovn-metadata-agent

# On the designated Storage Node (if applicable):
$ sudo systemctl status pf9-cindervolume-base
```

#### 8. Check the logs on the Hypervisors

Review the compute logs on the Source and Target compute nodes. Search for the `REQ_ID` or `VM_UUID`.&#x20;

If the VM is volume-backed, locate the storage host to check the Cinder logs.

Find the Storage Node (If VM has volumes):

```bash
# 1. Get the attached volume ID
$ openstack server show <VM_UUID> 

# 2. Identify the host managing the volume
$ openstack volume show <VOLUME_ID> 
```

Check Logs on respective nodes:

```bash
# On Source and Target Compute Nodes:
$ less /var/log/pf9/ostackhost.log
$ less /var/log/libvirt/libvirtd.log
$ less /var/log/libvirt/qemu/<VM_INSTANCE_NAME>.log

# On the identified Storage Node:
$ less /var/log/pf9/cindervolume-base.log
```

## Most common causes

* **Libvirt TLS Communication Failure**: Firewalls blocking the secure migration port (TCP 16514) or issues with the TLS certificates (e.g., expired certs, mismatched hostnames, or untrusted CA) preventing the nodes from authenticating each other.
* **Availability Zone / Host Aggregate Mismatch**: The destination compute node belongs to a different Availability Zone than the source, or it lacks the specific Host Aggregate metadata required by the VM's flavor or image properties, causing the Nova scheduler to reject the placement.
* **CPU Flag Incompatibility**: The destination host has a different or older CPU architecture than the source host, preventing live migration.
* **Insufficient Resources**: The target host lacks the required RAM or disk space to accept the incoming VM.
* **Storage/Volume Issues**: The target host cannot access the shared storage backend, or `cinder-volume` (running on the storage node) cannot successfully map the volume via iSCSI/multipath to the destination host.
* **Network Port Binding Failures**: Neutron fails to bind the VM's network interface to the OVS/OVN bridge on the new host.
