# Virtual Machine Deletion Issues

## Problem

This guide provides step-by-step instructions for troubleshooting and resolving issues when deleting a virtual machine (VM) fails or hangs in Private Cloud Director.

## Environment

* Private Cloud Director Virtualization&#x20;
* Self-Hosted Private Cloud Director Virtualization&#x20;

## Key Concept: Ephemeral vs. Volume-Booted Instances

Before troubleshooting a deletion failure, it is critical to identify how the VM's storage was deployed, as this dictates the deletion workflow:

* **Ephemeral Instances**: Booted from an image directly onto the compute node's local disk. The OS disk and local data are automatically destroyed by the hypervisor during a standard deletion.
* **Volume-Booted Instances**: Booted from a persistent Cinder volume. The volume's fate depends on the "Delete on Terminate" flag set during creation. If checked, PCD destroys the volume; if unchecked, the volume is simply detached and marked as `available`.

### Various VM Deletion Methods & Examples

Understanding how a VM was requested to be deleted is crucial for tracing where the failure occurred.

### 1. Standard Delete

Gracefully shuts down and removes the instance, releasing its ephemeral storage and network ports.

```bash
$ openstack server delete <VM_UUID>
# To block the CLI prompt until the deletion is fully complete:
$ openstack server delete --wait <VM_UUID>
```

### 2. Force Delete (For Stuck Instances)

Bypasses standard graceful shutdown procedures to forcefully terminate an instance that is stuck in a locked task state (e.g., `deleting` or `migrating`).

```bash
# Step 1: Break the task lock by forcing the instance into an 'error' state
$ openstack server set --state error <VM_UUID>

# Step 2: Issue the standard delete now that the lock is cleared
$ openstack server delete <VM_UUID>
```

### 3. Handling Persistent Volumes During Deletion

If the VM has attached Volumes, you must verify the requested outcome for the data (Retain vs. Destroy).&#x20;

Execution (If data MUST be destroyed):

```bash
# Step 1: Delete the VM
$ openstack server delete <VM_UUID>

# Step 2: Verify the volume is detached and in an 'available' state
$ openstack volume list | grep <VOLUME_UUID>

# Step 3: Delete the persistent volume (ONLY if required)
$ openstack volume delete <VOLUME_UUID>
```

## Deep Dive

The Private Cloud Director VM deletion process requires tight coordination between Compute, Networking and Storage services. A failure in any of these handoffs can result in a VM getting stuck in a `deleting` task state or leaving behind "orphaned" resources.

{% hint style="warning" %}
The logs can only be reviewed in Self-Hosted Private Cloud Director.&#x20;
{% endhint %}

### Step 1: User Request & API Validation

The deletion request is authenticated by Keystone and received by the Nova API.

* State Check: Nova API verifies the VM exists and updates the database to mark the `task_state` as `deleting`.

  ```bash
  $ kubectl logs deployment/nova-api-osapi -n <WORKLOAD_REGION> | grep "DELETE /v2.1"
  INFO nova.osapi_compute.wsgi.server [None [REQ_ID] [USER_ID] [TENANT_ID] - - default default] [IP] "DELETE /v2.1/[tenant_id]/servers/[VM_UUID] HTTP/1.1" status: 204 len: [.] time: [.]
  ```

{% hint style="info" %}
Here a unique `REQ_ID` will be generated, which will be further used for tracking the request in other component log
{% endhint %}

### Step 2: Compute Node & Hypervisor Cleanup

The  `pf9-ostackhost`service on the host running the VM receives the RPC message to destroy the instance.

* Libvirt Teardown: `nova-compute` instructs Libvirt to power off the VM (if running) and undefine the XML domain.
* Ephemeral Storage Purge: The instance directory(/`opt/pf9/data/instances/[VM_UUID]`) containing the disk is permanently deleted.

  ```bash
  $ less /var/log/pf9/ostackhost.log | grep <REQ_ID>
  INFO nova.compute.manager [[REQ_ID] [USERNAME] service] [instance: [VM_UUID]] Terminating instance
  INFO nova.virt.libvirt.driver [[REQ_ID] [USERNAME] service] [instance: [VM_UUID]] Instance destroyed successfully.
  ```

### Step 3: Storage Detachment

If the VM has attached Volumes, Nova requests the Storage service (Cinder) to detach them.

* If the volume was configured to "Delete on Terminate", Cinder proceeds to delete the volume from the storage backend.
* If not (which is standard for data retention), the volume status simply changes back to `available` for future use.

{% hint style="warning" %}
If a volume is stuck in a `detaching` state (e.g., due to a lock or unresponsive storage backend), the entire VM deletion process will hang.
{% endhint %}

### Step 4: Network (Neutron) Cleanup

Nova signals Neutron to unbind and delete the virtual network interface (VIF) and associated ports. Neutron removes the OVS/OVN rules and releases the IP address back to the IPAM pool.

### Step 5: Final Database Purge

Once all resources report successful deletion, `nova-conductor` marks the instance as `deleted` in the database, effectively removing it from the user's view and freeing up project quota.

## Procedure

### 1. Get the VM Status

Check the current state of the VM. If it is stuck, it will typically show a status of `ERROR` or a task state of `deleting`. Look specifically at the `status`, `OS-EXT-STS:task_state`, and `fault` fields.

```bash
$ openstack server show <VM_UUID>
```

### 2. Attempt a Force Delete (If stuck)

If the standard delete has hung for more than 10-15 minutes, attempt to force the deletion via the API by resetting its state.

```bash
$ openstack server set --state error <VM_UUID>
$ openstack server delete <VM_UUID>
```

### 3. Trace the VM Events

Retrieve the Request ID (`REQ_ID`) from the server event list to find exactly which component (Compute, Network, or Storage) is holding up the deletion.

```bash
$ openstack server event list <VM_UUID>
$ openstack server event show <VM_UUID> <REQ_ID>
```

### 4. Verify Compute Service Health

A VM cannot be cleanly deleted if the compute node hosting it is offline or the `pf9-ostackhost` service is dead. Check if the host is `up`.

```bash
$ openstack compute service list
```

### 5. Check for Stuck Volumes & Verify Retention

Identify if the VM has attached volumes that are failing to detach, and confirm whether those volumes are meant to be kept or purged.

```bash
# 1. List attached volumes
$ openstack server show <VM_UUID> -c "os-extended-volumes:volumes_attached"

# 2. Check the volume status (Look for 'detaching' or 'error_deleting')
$ openstack volume show <VOLUME_ID>

# 3. If a volume is genuinely stuck, you may need to reset its state manually (Admin only) to unblock the VM deletion
$ openstack volume set --state available <VOLUME_ID>
```

### 6. Check for Orphaned Neutron Ports

Sometimes the network port gets locked by another service (like a stale floating IP or router interface), preventing Nova from deleting it.

```bash
# 1. List ports associated with the VM
$ openstack port list --device-id <VM_UUID>

# 2. If ports exist but the VM is gone/stuck, force delete the port manually
$ openstack port delete <PORT_ID>
```

### 7. Review the Logs on the Compute Node

If the VM is still stuck, review the compute node's logs to see if Libvirt is failing to destroy the domain (e.g., due to a hung QEMU process). Search for the `REQ_ID` or `VM_UUID`.

```bash
$ less /var/log/pf9/ostackhost.log
$ less /var/log/libvirt/libvirtd.log
```

### 8. Last Resort: Manual Database Update (Hard DB Delete)

If all API methods fail and the underlying hypervisor, storage, and network resources have been *manually verified as destroyed or safely detached*, you can forcefully remove the VM from the database to clear it from the UI.

{% hint style="danger" %}
The steps below can only be performed in Self-Hosted Private Cloud Director. For SAAS model, kindly contact the Platform9 Support Team.

Modifying the database directly bypasses all safety checks. This will NOT free up physical compute resources, network IPs, or storage. Only perform this if you have already manually purged the VM's Libvirt XML on the compute node and deleted its Neutron ports. Failure to clean up physical resources first will result in orphaned "ghost" infrastructure and capacity leaks.
{% endhint %}

```bash
# 1. Export your target namespace
export NS=<WORKLOAD_NAMESPACE>

# 2. Extract the DB Server name from Consul
export DBSERVER=$(kubectl exec deploy/resmgr -c resmgr -n $NS -it -- bash -c "consul-dump-yaml --start-key customers/\$CUSTOMER_ID/regions/\$REGION_ID/db" | yq -r '.customers.[$CUSTOMER_ID].regions[$REGION_ID].dbserver')

# 3. Extract the DB Admin Password
export DBADMINPASS=$(kubectl exec deploy/resmgr -c resmgr -n $NS -it -- bash -c "consul-dump-yaml --start-key customers/\$CUSTOMER_ID/dbservers/$DBSERVER" | yq -r '.customers.[$CUSTOMER_ID].dbservers.[$DBSERVER].admin_pass')

# 4. Connect to the MySQL Database
kubectl exec -it deploy/mysqld-exporter -n $NS -c mysqld-exporter -- mysql resmgr -u root -p"$DBADMINPASS"

# 5. At the MySQL prompt, mark the instance as deleted
MySQL [resmgr]> use nova;
MySQL [nova]> UPDATE instances SET vm_state='deleted', task_state=NULL, deleted=id, deleted_at=NOW() WHERE uuid='<VM_UUID>';
MySQL [nova]> exit
```

## Most common causes

* **Dead Compute Node**: The host machine is offline, down, or isolated from the management plane, meaning it never receives the RPC message to delete the VM.
* **Stuck Volume Detachment**: Cinder cannot terminate the iSCSI session or Ceph RBD lock, causing the volume to freeze in a `detaching` state, which holds the Nova deletion task hostage.
* **Hung QEMU/KVM Process**: The hypervisor process for the VM has become unresponsive to standard ACPI shutdown signals or `libvirt` destroy commands.
* **Locked Network Ports**: A Neutron port fails to delete because it is still incorrectly bound to a floating IP or a stale security group rule.
* **Database Desynchronization**: A temporary network blip caused `pf9-ostackhost`  to successfully delete the VM locally, but it failed to update `nova-conductor`, leaving a "ghost" VM in the dashboard.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://platform9.com/kb/pcd-ts/vm-deployment/virtual-machine-deletion-issues.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
