VM Fails to Start After Live Migration Failure Due to iSCSI BDM Corruption
Problem
After a live migration fails on a host using iSCSI storage (NetApp ONTAP), the affected VM remains in error state and cannot be started or hard-rebooted. The VM was running normally before the migration was initiated. Post-migration-failure symptoms include:
The VM shows
status: ERRORorvm_state: errorin PCDHard reboot attempts fail without the VM coming back online
The VM's volume remains attached in the platform but the instance cannot mount it on start
virsh list --allon the source compute host shows the VM as absent or in a failed domain state
This article covers the recovery process for the Block Device Mapping (BDM) corruption that occurs as a consequence of the failed migration. If libvirtd on the compute host is also in a failed state and virsh commands are unresponsive, resolve that condition first by following Virsh Commands Run Infinitely Without Output / VM Console Inaccessible before proceeding with the steps below.
Environment
Private Cloud Director Virtualization - All versions
Self-Hosted Private Cloud Director Virtualization - All versions
Component: Compute Service, iSCSI Storage (NetApp ONTAP)
Cause
When a live migration is initiated, the Compute Service runs pre_live_migration on the destination host. This step maps the volume to the destination host's igroup on the NetApp array and updates the Block Device Mapping (BDM) record in the Nova database with the destination LUN ID.
If the migration fails after pre_live_migration runs but before the rollback completes, two problems persist:
BDM has the wrong LUN ID — The BDM
connection_infofield was updated with the destination LUN ID. The source host's igroup has a different LUN ID assigned for the same volume. The Compute Service uses the BDM to attach the disk when starting the VM, so the mismatch prevents the VM from mounting its boot volume.Volume is dual-mapped on the NetApp array — The volume remains mapped to the destination host's igroup even though the migration never completed.
Diagnostics
Step 2 — Check the BDM for the wrong LUN ID
Connect to the Nova database and retrieve the connection_info for the volume. The target_lun field is the LUN ID the Compute Service will use when attaching the volume to the VM.
Self-Hosted only: Database access requires access to the MySQL pods running in the control plane. SaaS customers cannot perform this step — contact Platform9 Support to check the BDM.
Open a MySQL session on the Nova database using the following commands on the control plane node. Replace <REGION_NS>, <CUSTOMER_ID>, and <REGION_UUID> with the values for the affected region:
At the MySQL [nova]> prompt, run the following query:
Note the current_lun value. This will be compared against the source host's igroup LUN ID in Step 3 to confirm whether the BDM is corrupted.
Step 3 — Find the correct LUN ID from the NetApp array
Navigate to the NetApp ONTAP System Manager UI and locate the LUN for the affected volume.
Go to Storage → LUNs and find the LUN by name or path (the volume UUID is typically part of the LUN path)
Click into the LUN and open the LUN Mappings tab
Review the list of igroups the LUN is mapped to
The expected state after a failed migration is that the LUN appears in two igroup entries — one for the source host and one for the destination host. Each igroup will have a different LUN ID assigned.
Note:
The source host's igroup LUN ID — this is the correct value and must match the BDM
The destination host's igroup LUN ID — this is the stale entry to be removed in Step 2 of the Recovery
If the source host's igroup is not listed at all, the mapping was removed by a failed hard reboot attempt. Proceed to Step 4 before fixing the BDM.
Step 4 — Re-add the source igroup mapping if missing
Skip this step if the source host's igroup is still listed on the LUN.
If the source igroup mapping is absent:
In the NetApp ONTAP System Manager UI, go to Storage → LUNs, open the LUN, and click the LUN Mappings tab
Click Map and select the source host's igroup
After saving, note the LUN ID assigned to the source host's igroup — NetApp assigns this automatically
Return to the LUN Mappings tab and confirm the source igroup now appears with its assigned LUN ID
After re-mapping, rescan iSCSI on the source host to pick up the new path:
Confirm the LUN appears under the correct dm device and that paths show active ready rather than failed faulty.
Resolution
Fix the BDM and Restore VM Operation
Step 1 — Update the BDM with the correct LUN ID
Using the source host's igroup LUN ID from Diagnostics Step 3, update the connection_info in the Nova database. The target_luns array has one entry per iSCSI path — set all entries to the same LUN ID (NetApp typically presents 4 paths per volume).
Self-Hosted only: Database write access requires access to the MySQL pods running in the control plane. SaaS customers cannot perform this step — contact Platform9 Support to apply the BDM correction.
Open a MySQL session using the same commands from Diagnostics Step 2 before running the UPDATE below.
Always verify the id, instance_uuid, and volume_id values from the SELECT query in Diagnostics Step 2 before running the UPDATE. An incorrect UPDATE to the BDM can prevent the VM from attaching any volume.
Verify both fields updated correctly:
Both bdm_lun and all values in bdm_luns must match the source host's LUN ID.
Step 2 — Remove the stale destination igroup mapping
Remove the volume from the destination host's igroup on the NetApp array. This is the stale mapping left behind by the failed migration.
In the NetApp ONTAP System Manager UI:
Go to Storage → LUNs, open the LUN, and click the LUN Mappings tab
Find the destination host's igroup entry
Click Unmap to remove it
Until this mapping is removed, the volume is accessible from two compute hosts simultaneously. This is unsafe for non-cluster-aware filesystems and must be cleaned up before the VM is restarted.
After unmapping, confirm only the source host's igroup remains in the LUN Mappings tab.
Step 3 — Reset the VM state and hard reboot
Reset the VM from error state to active and perform a hard reboot to trigger volume attachment with the corrected BDM as mentioned in How To Reset Volume States
Additional Information
Related articles:
Virsh Commands Run Infinitely Without Output / VM Console Inaccessible — covers the
libvirtdfailure and stale multipath recovery that commonly precedes the BDM corruption scenario described in this article. Resolve that issue first ifvirshcommands are unresponsive on the affected compute host.How To Reset Volume States — covers resetting volumes stuck in intermediate states if the volume shows an error state in addition to the VM.
Last updated
