Troubleshooting Virtual Machine Pause State & Hung Issues
Problem
This guide provides step-by-step instructions for troubleshooting and resolving issues when a virtual machine (VM) unexpectedly enters a PAUSED state or becomes completely unresponsive (hung) in Private Cloud Director.
Environment
Private Cloud Director Virtualization - v2025.4 and Higher
Self-Hosted Private Cloud Director Virtualization - v2025.4 and Higher
Deep Dive: Architecture & State Management
The state of a Virtual Machine in Private Cloud Director is tracked in two places: the PCD database (managed by Nova) and the actual Hypervisor layer (managed by Libvirt/QEMU). When troubleshooting pause or hang issues, it is critical to understand that these two states can sometimes become desynchronized.
Hypervisor-Initiated Pause (I/O Error): The VM is paused automatically by QEMU to prevent data corruption due to underlying storage backend disconnects or full disks. It sends an asynchronous event to
pf9-ostackhost, which updates the OpenStack DB toPAUSED.Admin-Initiated Pause: The VM was intentionally paused via the OpenStack API or Dashboard by an administrator.
OOM (Out of Memory) Lockup: The VM shows as
ACTIVEin OpenStack, but is completely hung because the hypervisor's OOM killer terminated the QEMU process, or the Guest OS exhausted its memory.Guest OS Kernel Panic: The VM shows as
ACTIVE, but is unresponsive to ping or SSH due to a Guest OS kernel crash or a bootloader failure.
Procedure
1. Get the VM status
Check the current state known to the control plane and identify the host.
Look for the
statusandOS-EXT-SRV-ATTR:hostfields. Note the compute host where the VM resides. If OpenStack showsACTIVEbut the VM is unreachable, the issue lies either inside the Guest OS or at the Libvirt level.Ephemeral Root: If
imagelists an image name/ID, the VM's OS lives on the compute node's local disk. (Any volumes listed involumes_attachedare secondary).Volume-Backed Root: If
imageis completely empty, the VM is Boot-from-Volume
2. Verify the Libvirt State on the Compute Node
Log into the underlying hypervisor (Compute Node) identified in Step 1 and verify the physical process state.
Compare this to the OpenStack status. If
virshsayspausedbut OpenStack saysACTIVE, the states are out of sync.
3. Interrogate the QEMU Monitor for I/O Errors
If the VM is physically paused at the hypervisor level, ask the QEMU monitor for the exact reason.
If the output says
VM status: paused (io-error)or(enospc), you have a storage disconnect or a full local disk on the hypervisor.
Do not unpause the VM until the underlying storage issue is resolved, or you risk corrupting the Guest OS filesystem.
4. Attempt to Unpause / Resume the VM
If the underlying storage is healthy (or was just fixed), attempt to wake the VM back up.
If the VM immediately pauses again after running this, the hypervisor is still detecting a critical storage failure.
5. Check Compute Node Logs
Review hypervisor-level logs to see if OpenStack caught the pause event, or to find low-level QEMU crashes.
In
ostackhost.log, look forUnpausing instanceorResuming instance.In the
qemulog, look forblock I/O error in device 'drive-virtio-disk0': No space left on device (28). This confirms a full disk or storage drop.
6. Dump the Console Log (For Silent Hangs)
If the VM is ACTIVE everywhere but completely unresponsive to ping/SSH, the Guest OS likely crashed. Dump the console to look for kernel panics or boot loops.
Analysis: Look for OS-level errors like
Kernel panic - not syncing, Windows Blue Screen stop codes, orcloud-initfailures.
Most Common Causes
Storage Disconnects: The Compute Node temporarily lost connection to the Cinder volume backend triggering an automated QEMU safety pause.
No Space Left on Device (
enospc): The hypervisor's local disk (/opt/pf9/data/instances) is 100% full, preventing local ephemeral disks from writing data.OOM Killer: The physical compute node ran out of RAM, and the Linux kernel terminated the QEMU process to save the host.
Guest OS Kernel Panic: Bad updates, corrupted filesystems, or misconfigured drivers inside the VM caused Windows/Linux to blue-screen or kernel panic.
Host CPU Lockup: A hardware CPU thread locked up on the physical server, hanging the vCPUs of the guest VM.
Last updated
