Virtual Machine Migration Issues

Problem

This guide provides step-by-step instructions for troubleshooting and resolving issues that arise when a virtual machine (VM) migration fails.

Environment

  • Private Cloud Director Virtualization

  • Self-Hosted Private Cloud Director Virtualization

Various VM Migration Methods

  • Live Migration (Shared Storage): Move a running instance seamlessly between compute nodes that share the same underlying storage backend (like Ceph or NFS).

  • Block Live Migration: Move a running instance, including its local disk, to another compute node over the network without requiring shared storage.

  • Cold Migration: Shut down the instance, move its data to a new host, and power it back on. Often used for hardware maintenance or resizing.

  • Evacuation: Rescue instances from a failed compute node by rebuilding them on an available, healthy host.

Deep Dive

The Private Cloud Director VM migration process is orchestrated primarily by the Compute service (Nova). This flow involves a complex interaction between the source host, the destination host, and the management plane to ensure data consistency and minimal network interruption.

NOTE

The logs below can only be reviewed in Self-Hosted Private Cloud Director.

Step 1: User Request & API Validation

This is the initial stage where the migration request is received and validated.

  • User Request: A user submits a request to migrate a VM. The nova-api-osapi pod validates the authentication token with Keystone.

  • State Check: Nova API checks if the VM is in a valid state for the requested migration type (e.g., ACTIVE for live migration). The output below shows the migration action was accepted with a 202 status.

Here a unique REQ_ID will be generated, which will be further used for tracking the request in the other component log

  • Database Update: The nova-conductor service updates the database, setting the VM task state to migrating.

Step 2: Scheduling & Destination Selection

Unless a specific target host is forced by the user, the request goes to the Nova Scheduler.

  • Host Filtering: The scheduler evaluates available compute nodes, filtering out the source host and checking for sufficient RAM, CPU, and compatible CPU architectures.

  • Placement Claim: The scheduler queries the Placement API to reserve resources on the selected destination host.

Step 3: Pre-Migration & Resource Setup

The source and destination pf9-ostackhost services begin communicating to prepare the environment.

  • Destination Setup: The destination pf9-ostackhost creates the necessary XML definitions for the hypervisor (Libvirt) and prepares virtual network interfaces (VIFs) by communicating with Neutron (Networking Service).

  • Storage Setup: If volumes are attached, Cinder is called to ensure the destination host has the necessary storage multipath connections or iSCSI sessions established.

Step 4: Memory/Disk Transfer & Cutover

The actual movement of data occurs at the hypervisor level

  • Data Transfer (Live): Libvirt continuously copies the VM's memory pages from the source to the destination over the TLS-secured connection. If Block Live Migration is used, the disk is also copied over the network.

  • Switchover: Once the memory is fully synchronized, the VM is paused on the source for a fraction of a second, final changes are copied, and the VM resumes on the destination.

  • Network Update: Neutron updates the network port bindings to the new host and sends a Gratuitous ARP (GARP) to update physical network switches.

  • Cleanup: The source nova-compute destroys the old Libvirt domain and cleans up local storage. nova-conductor updates the VM state back to ACTIVE on the new host.

Procedure

1. Get the VM and Migration Status

Use the CLI to check the error message and the specific migration record.

Look for the fault field in the server show output, and check the status in the migration list.

2. Validate Compute & Storage Service Status

Ensure the compute services on both the source and target hosts are up and enabled.

3. Trace the VM Events

Retrieve the Request ID (REQ_ID) from the server event list to track the specific migration failure across different component logs.

4. Review the Pods and its logs on the Management plane

This step is applicable only for the Self-Hosted Private Cloud Director.

Check the management plane pods to see if the Scheduler failed to find a host or if the Conductor hit an error. Review pod logs using the REQ_ID.

5. Validate Target Host Capacity & Requirements

Check if the destination hypervisor actually has the resources and matches the CPU architecture of the source host.

6. Validate Availability Zones & Host Aggregates

Ensure the destination host belongs to the correct Availability Zone and possesses the required aggregate metadata matching the VM's flavor.

7. Validate the service status on the Source and Target Hypervisors

Validate that the compute and virtualization daemon are running on both compute nodes involved in the migration. If the VM has attached volumes, you must also verify the Cinder volume service on the node holding the persistent storage role.

8. Check the logs on the Hypervisors

Review the compute logs on the Source and Target compute nodes. Search for the REQ_ID or VM_UUID.

If the VM is volume-backed, locate the storage host to check the Cinder logs.

Find the Storage Node (If VM has volumes):

Check Logs on respective nodes:

Most common causes

  • Libvirt TLS Communication Failure: Firewalls blocking the secure migration port (TCP 16514) or issues with the TLS certificates (e.g., expired certs, mismatched hostnames, or untrusted CA) preventing the nodes from authenticating each other.

  • Availability Zone / Host Aggregate Mismatch: The destination compute node belongs to a different Availability Zone than the source, or it lacks the specific Host Aggregate metadata required by the VM's flavor or image properties, causing the Nova scheduler to reject the placement.

  • CPU Flag Incompatibility: The destination host has a different or older CPU architecture than the source host, preventing live migration.

  • Insufficient Resources: The target host lacks the required RAM or disk space to accept the incoming VM.

  • Storage/Volume Issues: The target host cannot access the shared storage backend, or cinder-volume (running on the storage node) cannot successfully map the volume via iSCSI/multipath to the destination host.

  • Network Port Binding Failures: Neutron fails to bind the VM's network interface to the OVS/OVN bridge on the new host.

Last updated