# Unable to Start VM due to Multipath issue

## Problem

* VM fails to reboot with error:&#x20;

```ruby
multipath -f [DEVICE_ID]: map in use
```

* VM stuck in "spawning" or "rebooting" task state in Nova database
* Error in `/var/log/pf9/ostackhost.log`:&#x20;

```ruby
Cannot reboot instance: Command: multipath -f [DEVICE_ID]Exit code: 1 Stderr: '<DEVICE_ID>: map in use'
```

## Environment

* Private Cloud Director Virtualization - v2025.4 and Higher
* Private Cloud Director Kubernetes – v2025.4 and Higher
* Self-Hosted Private Cloud Director Virtualization - v2025.4 and Higher
* Self-Hosted Private Cloud Director Kubernetes - v2025.4 and Higher
* Component - iSCSI storage backend

## Cause

The multipath device lock occurs when the device-mapper has active references to a storage volume that cannot be released during VM reboot/shutdown operations. This is typically caused by:

**Accumulated State Corruption:** Weeks of failed operations leave:

* Stale device-mapper references
* Unclosed iSCSI sessions
* Incomplete volume detach operations
* Orphaned multipath devices

When a reboot is initiated, Nova attempts to disconnect volumes by flushing the multipath device. However, accumulated stale references from previous failed operations prevent the flush, causing the `map in use` error even though no active processes are using the device.

## Diagnostics

#### Step 1: Verify the Multipath Device Lock

```bash
# On the compute node, check multipath status
$ sudo multipath -ll

# Attempt to flush the specific device (will fail with "map in use")
$ sudo multipath -f <DEVICE_ID>

# Check device-mapper status
$ sudo dmsetup ls
$ sudo dmsetup info <DEVICE_ID>

# Check for processes using the device (may show none)
$ sudo lsof /dev/mapper/<DEVICE_ID>
$ sudo fuser -vm /dev/mapper/<DEVICE_ID>
```

**Expected output:**

* `multipath -f` returns exit code 1 with error: `[DEVICE_ID]: map in use`
* `lsof` and `fuser` may show no processes, indicating stale kernel references

#### Step 2: Check VM State in Nova

```bash
# From control plane or with OpenStack credentials
$ openstack server show <VM_ID>

# Check task state (look for stuck states)
$ openstack server show <VM_ID> -f value -c OS-EXT-STS:task_state

# Check VM state
$ openstack server show <VM_ID> -f value -c OS-EXT-STS:vm_state
```

**Expected output:**

* `task_state` may show: `rebooting`, `powering-on`, or similar
* `vm_state` may show: `error`, `stopped`, or mismatch with actual hypervisor state

#### Step 3: Check iSCSI Sessions

```bash
# On compute node
$ sudo iscsiadm -m session

# Check for stale sessions related to the volume
$ sudo iscsiadm -m session -P 3 | grep -A 20 <DEVICE_ID>
```

#### Step 4: Review Compute Logs

```bash
# Check for the specific error
$ sudo grep -A 5 "Cannot reboot instance" /var/log/pf9/ostackhost.log

# Check for device cleanup warnings
$ sudo grep "leftovers may remain" /var/log/pf9/ostackhost.log
```

## Resolution

#### Method 1: Quick Recovery - VM Rebuild&#x20;

**Use when:** Service restoration is priority, time is limited, production environment

**Time:** 15-20 minutes

**Steps:**

1. **Document current VM configuration:**

   ```bash
   $ openstack server show <VM_ID> -f json > vm-config-backup.json
   $ openstack server volume list <VM_ID> > vm-volumes-backup.txt
   $ openstack port list --server <VM_ID> > vm-ports-backup.txt
   ```
2. **Verify volume preservation flag:**

   ```bash
   # Check if delete_on_termination is false for volumes
   $ openstack server show <VM_ID> 

   # If needed, update the flag
   $ openstack server volume set --preserve-on-termination <VM_ID> <VOLUME_ID>

   ```
3. **Create volume snapshot through the UI or using following commands:**

   ```bash
   # For each attached volume
   $ openstack volume snapshot create --volume <VOLUME_ID>\
     --name "snapshot-<vm-name>-$(date +%Y%m%d-%H%M%S)"

   # Wait for snapshot to complete
   $ openstack volume snapshot list --volume <VOLUME_ID>
   ```
4. **Delete the VM (preserving volumes):**

   ```bash
   $ openstack server delete <VM_ID>

   # Wait for deletion to complete (30-60 seconds)
   $ openstack server list | grep <VM_ID>

   # Verify volumes still exist
   $ openstack volume list | grep <VOLUME_ID>
   ```
5. **Clean up multipath device on compute node:**

   ```bash
   # SSH to the compute node
   $ sudo multipath -r <DEVICE_ID>

   # Verify device is removed
   $ sudo multipath -ll | grep <DEVICE_ID>
   ```
6. **Recreate VM with same configuration:**

   ```bash
   # Use saved configuration from step 1
   $ openstack server create \
     --flavor <FLAVOR_ID> \
     --volume <VOLUME_ID> \
     --nic port-id=<PORT_ID> \
     --availability-zone <AZ> \
     <VM_NAME>

   # Monitor creation
   $ openstack server show <NEW_VM_ID>
   ```
7. **Verify VM is running:**

   ```bash
   $ openstack server show <NEW_VM_ID>

   # Test connectivity
   $ ping <vm-ip>
   $ ssh <vm-ip>
   ```

#### Method 2: Manual Recovery - Multipath Cleanup (For Investigation/Non-Production)

**Use when:** time available for investigation, want to preserve VM ID

**Time:** 45-60 minutes

**Steps:**

1. **Stop the VM:**

   ```bash
   $ openstack server stop <VM_ID>
   ```
2. **On compute node, identify the stuck device:**

   ```bash
   $ sudo multipath -ll
   $ sudo ls -la /dev/disk/by-id/ | grep <VOLUME_WWN>
   $ sudo dmsetup ls
   ```
3. **Check what's using the device:**

   ```bash
   $ sudo lsof /dev/mapper/<DEVICE_ID>
   $ sudo fuser -vm /dev/mapper/<DEVICE_ID>
   $ mount | grep <DEVICE_ID>
   ```
4. **Force stop VM in libvirt (if running):**

   ```bash
   $ sudo virsh list --all | grep <VM_ID>
   $ sudo virsh destroy <VM_NAME>
   ```
5. **Remove device-mapper device:**

   ```bash
   # Try normal remove first
   $ sudo dmsetup remove <DEVICE_NAME>

   # If fails, force remove
   $ sudo dmsetup remove --force <DEVICE_NAME>

   ## If the above --force is failing with "Device or resource busy" and no consumers or backing path
   $ sudo systemctl restart multipathd
   ```
6. **Flush multipath device:**

   ```bash
   # Try reload first
   $ sudo multipath -r <DEVICE_ID>

   # If reload doesn't work, try flush
   $ sudo multipath -f <DEVICE_ID>

   # Verify removal
   $ sudo multipath -ll | grep <DEVICE_ID>
   ```
7. **Rescan SCSI/iSCSI:**

   ```bash
   # Rescan SCSI hosts
   for host in /sys/class/scsi_host/host*; do 
     echo "- - -" > $host/scan
   done

   # Rescan iSCSI sessions
   sudo iscsiadm -m session --rescan

   ```
8. **Restart compute service:**

   ```bash
   $ sudo systemctl restart pf9-ostackhost

   # Verify service is running
   $ sudo systemctl status pf9-ostackhost
   ```
9. **Start the VM:**

   ```bash
   $ openstack server start <VM_ID>

   # Monitor startup
   $ openstack server show <VM_ID>
   ```

## Validation

* Before starting the VM verify Multipath Cleanup:

```bash
# On compute node
$ sudo multipath -ll | grep <DEVICE_ID>
# Should return no results

$ sudo dmsetup ls | grep <DEVICE_ID>
# Should return no results

$ sudo lsof | grep <DEVICE_ID>
# Should return no results
```

* Once VM is started verify VM Status:

```bash
$ openstack server show <VM_ID>
# Should show: status=ACTIVE, power_state=Running, task_state=None

$ openstack server list --all | grep <VM_NAME>
# Should show: Status=ACTIVE

# Test connectivity
$ ping <VM_IP>
$ ssh <VM_IP>
```

* Verify Compute Service:

```bash
$ openstack compute service list 
# Should show: State=up, Status=enabled
```

* Check for Errors:

```bash
# On compute node
$ sudo tail -100 /var/log/pf9/ostackhost.log | grep -i error
# Should show no recent multipath or device-mapper errors
```

## Additional Information

**Regular Health Checks:**

```bash
# Check for VMs in ERROR state
$ openstack server list --status ERROR

# Check multipath health on compute nodes
$ sudo multipath -ll | grep -i fail
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://platform9.com/kb/pcd/storage/unable-to-start-vm-due-to-multipath-issue.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
