VM Fails to Start After Live Migration Failure Due to iSCSI BDM Corruption

Problem

After a live migration fails on a host using iSCSI storage (NetApp ONTAP), the affected VM remains in error state and cannot be started or hard-rebooted. The VM was running normally before the migration was initiated. Post-migration-failure symptoms include:

  • The VM shows status: ERROR or vm_state: error in PCD

  • Hard reboot attempts fail without the VM coming back online

  • The VM's volume remains attached in the platform but the instance cannot mount it on start

  • virsh list --all on the source compute host shows the VM as absent or in a failed domain state

This article covers the recovery process for the Block Device Mapping (BDM) corruption that occurs as a consequence of the failed migration. If libvirtd on the compute host is also in a failed state and virsh commands are unresponsive, resolve that condition first by following Virsh Commands Run Infinitely Without Output / VM Console Inaccessible before proceeding with the steps below.

Environment

  • Private Cloud Director Virtualization - All versions

  • Self-Hosted Private Cloud Director Virtualization - All versions

  • Component: Compute Service, iSCSI Storage (NetApp ONTAP)

Cause

When a live migration is initiated, the Compute Service runs pre_live_migration on the destination host. This step maps the volume to the destination host's igroup on the NetApp array and updates the Block Device Mapping (BDM) record in the Nova database with the destination LUN ID.

If the migration fails after pre_live_migration runs but before the rollback completes, two problems persist:

  1. BDM has the wrong LUN ID — The BDM connection_info field was updated with the destination LUN ID. The source host's igroup has a different LUN ID assigned for the same volume. The Compute Service uses the BDM to attach the disk when starting the VM, so the mismatch prevents the VM from mounting its boot volume.

  2. Volume is dual-mapped on the NetApp array — The volume remains mapped to the destination host's igroup even though the migration never completed.

Diagnostics

1

Step 1 — Confirm the VM state and identify the attached volume

Retrieve the VM state and the UUID of the affected volume.

Note the [VOLUME_UUID] for use in the following steps.

2

Step 2 — Check the BDM for the wrong LUN ID

Connect to the Nova database and retrieve the connection_info for the volume. The target_lun field is the LUN ID the Compute Service will use when attaching the volume to the VM.

Self-Hosted only: Database access requires access to the MySQL pods running in the control plane. SaaS customers cannot perform this step — contact Platform9 Support to check the BDM.

Open a MySQL session on the Nova database using the following commands on the control plane node. Replace <REGION_NS>, <CUSTOMER_ID>, and <REGION_UUID> with the values for the affected region:

At the MySQL [nova]> prompt, run the following query:

Note the current_lun value. This will be compared against the source host's igroup LUN ID in Step 3 to confirm whether the BDM is corrupted.

3

Step 3 — Find the correct LUN ID from the NetApp array

Navigate to the NetApp ONTAP System Manager UI and locate the LUN for the affected volume.

  1. Go to Storage → LUNs and find the LUN by name or path (the volume UUID is typically part of the LUN path)

  2. Click into the LUN and open the LUN Mappings tab

  3. Review the list of igroups the LUN is mapped to

The expected state after a failed migration is that the LUN appears in two igroup entries — one for the source host and one for the destination host. Each igroup will have a different LUN ID assigned.

Note:

  • The source host's igroup LUN ID — this is the correct value and must match the BDM

  • The destination host's igroup LUN ID — this is the stale entry to be removed in Step 2 of the Recovery

If the source host's igroup is not listed at all, the mapping was removed by a failed hard reboot attempt. Proceed to Step 4 before fixing the BDM.

4

Step 4 — Re-add the source igroup mapping if missing

Skip this step if the source host's igroup is still listed on the LUN.

If the source igroup mapping is absent:

  1. In the NetApp ONTAP System Manager UI, go to Storage → LUNs, open the LUN, and click the LUN Mappings tab

  2. Click Map and select the source host's igroup

  3. After saving, note the LUN ID assigned to the source host's igroup — NetApp assigns this automatically

  4. Return to the LUN Mappings tab and confirm the source igroup now appears with its assigned LUN ID

After re-mapping, rescan iSCSI on the source host to pick up the new path:

Confirm the LUN appears under the correct dm device and that paths show active ready rather than failed faulty.

Resolution

Fix the BDM and Restore VM Operation

1

Step 1 — Update the BDM with the correct LUN ID

Using the source host's igroup LUN ID from Diagnostics Step 3, update the connection_info in the Nova database. The target_luns array has one entry per iSCSI path — set all entries to the same LUN ID (NetApp typically presents 4 paths per volume).

Self-Hosted only: Database write access requires access to the MySQL pods running in the control plane. SaaS customers cannot perform this step — contact Platform9 Support to apply the BDM correction.

Open a MySQL session using the same commands from Diagnostics Step 2 before running the UPDATE below.

Verify both fields updated correctly:

Both bdm_lun and all values in bdm_luns must match the source host's LUN ID.

2

Step 2 — Remove the stale destination igroup mapping

Remove the volume from the destination host's igroup on the NetApp array. This is the stale mapping left behind by the failed migration.

In the NetApp ONTAP System Manager UI:

  1. Go to Storage → LUNs, open the LUN, and click the LUN Mappings tab

  2. Find the destination host's igroup entry

  3. Click Unmap to remove it

After unmapping, confirm only the source host's igroup remains in the LUN Mappings tab.

3

Step 3 — Reset the VM state and hard reboot

Reset the VM from error state to active and perform a hard reboot to trigger volume attachment with the corrected BDM as mentioned in How To Reset Volume States

4

Step 4 — Confirm the VM is running

Monitor the VM state until it reaches ACTIVE.

Confirm the volume is attached and the domain is visible on the source compute host:

Additional Information

Related articles:

Last updated