# Troubleshooting Virtual Machine Boot Failures

## Problem

This guide provides step-by-step instructions for troubleshooting Virtual Machines that have successfully deployed (status is `ACTIVE` in PCD) but fail to boot at the Guest OS or Hypervisor level. This includes VMs stuck at the BIOS/UEFI screen, boot-looping, suffering Kernel Panics, Windows BSODs, or failing to configure themselves via metadata.

## Environment

* Private Cloud Director Virtualization - v2025.4 and Higher
* Self-Hosted Private Cloud Director Virtualization - v2025.4 and Higher

## Deep Dive: The Boot Flow (Post-Creation)

Once the control plane wires the virtual hardware and tells Libvirt to "Start", the management plane steps back. From this point on, the boot process is entirely dependent on the Guest OS and its interaction with the virtual hardware.

The Guest Boot Pipeline:

1. QEMU Initialization: The hypervisor process starts and allocates the virtual RAM/CPU.
2. BIOS / UEFI Handoff: The virtual motherboard searches for a bootable block device (Volume or Image).
3. Bootloader (GRUB / Windows Boot Manager): Loads the OS kernel into memory.
4. Kernel / OS Initialization: The Linux or Windows operating system boots, mounting the root filesystem.
5. Metadata (Post-Boot): The OS (`cloud-init` or `cloudbase-init`) reaches out to the Cloud Metadata service (`169.254.169.254`) to download the user's SSH keys, password, and hostname.

## Procedure

### 1. Verify the VM State&#x20;

Confirm the VM actually passed the creation phase and is attempting to run.

```shellscript
$ openstack server show <VM_UUID> --fit
```

* &#x20;If `status` is `ERROR` or `BUILD`, stop here. This is a Creation issue, not a Boot issue.
* If `status` is `ACTIVE` and `power_state` is `Running`, the provisioning phase succeeded. Proceed to Step 2.

### 2. Console Log Inspection&#x20;

&#x20;Dump the raw TTY output from the Guest OS to see exactly where the boot process stalled.

```shellscript
$ openstack console log show <VM_UUID>
```

* &#x20; `No bootable device` / `Boot failed: not a bootable disk`: The BIOS cannot find a bootloader. The image is blank, corrupted, or the volume is not flagged as bootable.
* `Kernel panic - not syncing` (Linux): The OS kernel crashed. Usually caused by a corrupted image, incompatible hypervisor CPU flags, or missing drivers.
* `dracut-initqueue timeout` / `dropping to emergency shell` (Linux): The kernel loaded, but it cannot find or mount the root filesystem partition (`/`).
* `cloud-init [...] giving up on network`: The OS booted perfectly, but the VM has no network connection and cannot fetch its configuration.

{% hint style="info" %}
Note: Windows VMs rarely output useful errors to the serial console. If troubleshooting Windows, proceed immediately to Step 3).
{% endhint %}

### 3. Live VNC Interrogation (Windows & Linux)

If the console log is unhelpful or you need to intervene manually (e.g., interacting with the GRUB menu or Windows Recovery), generate a VNC console link.

```shellscript
$ openstack console url show <VM_UUID>
```

* Open the provided URL in a web browser.
  * Windows BSOD (`INACCESSIBLE_BOOT_DEVICE`): The Windows image lacks KVM/VirtIO storage drivers. You must inject `viostor`/`vioscsi` drivers into the image before booting.
  * GRUB Menu (Linux): The bootloader is intact but the kernel is failing to load automatically.
  * Black Screen / Blinking Cursor: The display driver may have crashed, or the OS is hung during early initialization.

### 4. Verify Image / Volume Boot Flags

&#x20;If the VM threw a `No bootable device` error in Step 2, verify the source media is actually configured correctly in PCD.

```shellscript
# If booted from an Image:
$ openstack image show <IMAGE_ID> 

# If booted from a Volume:
$ openstack volume show <VOLUME_ID> 
```

* Volume Bootable Flag: If `bootable` is `false`, QEMU will refuse to treat the attached disk as a primary boot device. Fix this with: `$ openstack volume set --bootable <VOLUME_ID>`.
* Inherited Metadata (Volume) / Image Properties: Look at the `properties` (for images) or `volume_image_metadata` (for volumes). Ensure tags like `hw_machine_type` (e.g., `q35`), `hw_firmware_type` (e.g., `uefi` vs `bios`), and `os_type` match what the Guest OS expects. If a volume was created from a bad image, the volume itself is fundamentally flawed and usually needs to be rebuilt from a corrected image.

### 5. Investigate Metadata Failures (SSH/Passwords)

If the VM boots to a login prompt in VNC, but you cannot SSH/RDP into it because the key pair/password was not injected, the initialization agent failed to reach the Cloud Metadata API.

* Logs to Check (Inside the VM via VNC):
  * Linux: `$ sudo tail -n 100 /var/log/cloud-init.log`
  * Windows: Check `C:\Program Files\Cloudbase Solutions\Cloudbase-Init\log\cloudbase-init.log`
* Look for `DataSourceNotFound`. The VM tried to route to `169.254.169.254` but failed.
* This is almost always a Network issue. Refer the following [doc.](/kb/pcd/networking/vms-unable-to-retrieve-metadata-from-cloud-init.md)

### 6. Hypervisor-Level Boot Crashes (QEMU / Libvirt)

&#x20;If the VM is `ACTIVE` but instantly turns off, or if VNC refuses to connect entirely, the QEMU process itself may have crashed immediately after handoff. Run on the Compute Node.

{% code title="Compute Host" %}

```shellscript
$ less /var/log/pf9/ostackhost.log       ##Track using UUIDs
$ cat /var/log/libvirt/qemu/<VM_UUID>.log
```

{% endcode %}

* Analysis :
  * `qemu-system-x86_64: ... CPU feature XYZ not found`: The requested Flavor told QEMU to emulate a specific CPU feature that your physical server hardware does not actually support. The boot aborts immediately.
  * `KVM internal error. Suberror: 1`: A severe hardware virtualization fault. Check `dmesg -T` on the hypervisor for underlying kernel/hardware issues.

## Most Common Post-Provisioning Boot Causes

* **VirtIO Driver Missing (Windows)**: The most common cause of Windows boot failures (`INACCESSIBLE_BOOT_DEVICE`). The image was built for VMware/Hyper-V and lacks KVM/VirtIO storage drivers.
* **Disk Format Mismatch**: Uploading an image as `raw` when the file is actually `qcow2` (or vice versa). The hypervisor's storage driver misinterprets the disk headers, leading to immediate bootloader failures. Refer the following [doc](/kb/pcd/image-library/vm-migration-and-booting-issue-due-to-disk-format-mismatch.md) to check this.
* **Corrupted Image Upload**: A `.ova` or compressed `.tar.gz` file was uploaded directly to the Image service without being extracted to a raw or `.qcow2` disk file first. QEMU cannot boot an archive file.
* **Metadata Routing Failure**: The VM boots flawlessly, but a missing default gateway or blocked DHCP prevents the VM from downloading its SSH keys from the metadata API.
* **Bootable Flag Missing**: A Cinder volume containing a valid OS was attached, but the platform wasn't told it was a boot device, so the BIOS skips it.
* Disk format mismatch could also be another potential cause.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://platform9.com/kb/pcd-ts/vm-deployment/troubleshooting-virtual-machine-boot-failures.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
