Troubleshooting Virtual Machine Pause State & Hung Issues

Problem

This guide provides step-by-step instructions for troubleshooting and resolving issues when a virtual machine (VM) unexpectedly enters a PAUSED state or becomes completely unresponsive (hung) in Private Cloud Director.

Environment

  • Private Cloud Director Virtualization - v2025.4 and Higher

  • Self-Hosted Private Cloud Director Virtualization - v2025.4 and Higher

Deep Dive: Architecture & State Management

The state of a Virtual Machine in Private Cloud Director is tracked in two places: the PCD database (managed by Nova) and the actual Hypervisor layer (managed by Libvirt/QEMU). When troubleshooting pause or hang issues, it is critical to understand that these two states can sometimes become desynchronized.

  • Hypervisor-Initiated Pause (I/O Error): The VM is paused automatically by QEMU to prevent data corruption due to underlying storage backend disconnects or full disks. It sends an asynchronous event to pf9-ostackhost, which updates the OpenStack DB to PAUSED.

  • Admin-Initiated Pause: The VM was intentionally paused via the OpenStack API or Dashboard by an administrator.

  • OOM (Out of Memory) Lockup: The VM shows as ACTIVE in OpenStack, but is completely hung because the hypervisor's OOM killer terminated the QEMU process, or the Guest OS exhausted its memory.

  • Guest OS Kernel Panic: The VM shows as ACTIVE, but is unresponsive to ping or SSH due to a Guest OS kernel crash or a bootloader failure.

Procedure

1. Get the VM status

Check the current state known to the control plane and identify the host.

  • Look for the status and OS-EXT-SRV-ATTR:host fields. Note the compute host where the VM resides. If OpenStack shows ACTIVE but the VM is unreachable, the issue lies either inside the Guest OS or at the Libvirt level.

  • Ephemeral Root: If image lists an image name/ID, the VM's OS lives on the compute node's local disk. (Any volumes listed in volumes_attached are secondary).

  • Volume-Backed Root: If image is completely empty, the VM is Boot-from-Volume

2. Verify the Libvirt State on the Compute Node

Log into the underlying hypervisor (Compute Node) identified in Step 1 and verify the physical process state.

  • Compare this to the OpenStack status. If virsh says paused but OpenStack says ACTIVE, the states are out of sync.

3. Interrogate the QEMU Monitor for I/O Errors

If the VM is physically paused at the hypervisor level, ask the QEMU monitor for the exact reason.

  • If the output says VM status: paused (io-error) or (enospc), you have a storage disconnect or a full local disk on the hypervisor.

circle-exclamation

4. Attempt to Unpause / Resume the VM

If the underlying storage is healthy (or was just fixed), attempt to wake the VM back up.

  • If the VM immediately pauses again after running this, the hypervisor is still detecting a critical storage failure.

5. Check Compute Node Logs

Review hypervisor-level logs to see if OpenStack caught the pause event, or to find low-level QEMU crashes.

  • In ostackhost.log, look for Unpausing instance or Resuming instance.

  • In the qemu log, look for block I/O error in device 'drive-virtio-disk0': No space left on device (28). This confirms a full disk or storage drop.

6. Dump the Console Log (For Silent Hangs)

If the VM is ACTIVE everywhere but completely unresponsive to ping/SSH, the Guest OS likely crashed. Dump the console to look for kernel panics or boot loops.

  • Analysis: Look for OS-level errors like Kernel panic - not syncing, Windows Blue Screen stop codes, or cloud-init failures.

Most Common Causes

  • Storage Disconnects: The Compute Node temporarily lost connection to the Cinder volume backend triggering an automated QEMU safety pause.

  • No Space Left on Device (enospc): The hypervisor's local disk (/opt/pf9/data/instances) is 100% full, preventing local ephemeral disks from writing data.

  • OOM Killer: The physical compute node ran out of RAM, and the Linux kernel terminated the QEMU process to save the host.

  • Guest OS Kernel Panic: Bad updates, corrupted filesystems, or misconfigured drivers inside the VM caused Windows/Linux to blue-screen or kernel panic.

  • Host CPU Lockup: A hardware CPU thread locked up on the physical server, hanging the vCPUs of the guest VM.

Last updated