High CPU Usage Due to Orphaned VM Processes on Hypervisor Nodes
Problem
In some environments, compute nodes may experience elevated CPU utilization caused by orphaned virtual machine (QEMU) processes. This occurs when certain instances are deleted or migrated at the control-plane level (Compute DB), but their corresponding QEMU processes continue to run on the hypervisor.
Symptoms include:
QEMU processes consuming CPU although the VM no longer exists in the database.
Mismatch between the number of instances reported in Nova DB versus running on the hypervisor.
Periodic warnings in
nova.compute.managerlogs during instance power-state synchronization.
Environment
Private Cloud Director Virtualization - v2025.6 and Higher
Private Cloud Director Kubernetes – v2025.6 and Higher
Self-Hosted Private Cloud Director Virtualization - v2025.6 and Higher
Self-Hosted Private Cloud Director Kubernetes - v2025.6 and Higher
Component: Compute Service
Cause
Compute Service periodically performs a synchronization cycle where it validates running instances on the hypervisor against entries in the Compute database.
In this case:
Several VMs were deleted or migrated, but their QEMU processes persisted on the source hypervisor.
These orphaned processes continued running because the default Nova behavior did not automatically clean them up.
As a result, CPU utilization on the affected hypervisor increased unnecessarily.
Additionally, frequent instance deletions/migrations caused temporary discrepancies in instance counts during sync cycles, contributing to repeated warnings in the logs.
Diagnostics
Power State Synchronization Warnings
Compute service logs indicate mismatches between DB-reported VM count and hypervisor-reported VM count:
These messages demonstrate:
The presence of additional QEMU processes not tracked in DB.
Consistent mismatch over multiple cycles.
Resolution
Below configuration is included as part of standard deployment from PCD - v2025.10-180 and higher
To ensure Compute Service automatically cleans up orphaned QEMU processes, the following configuration was added to /opt/pf9/etc/nova/conf.d/nova_override.conf on all affected compute hosts:
Restart the pf9-ostackhost service post adding the above change on the hypervisors. The restart of this service will also cleanup any VMs which are stuck in the deleting phase as per the Compute service database.
Effect of this configuration: When Nova detects an instance running on the hypervisor that does not exist in the Compute Service database:
Frees compute node CPU/memory resources.
Ensures hypervisor state aligns with Nova DB state.
Orphaned VMs are now automatically removed within the standard 30-minute sync cycle
Last updated
