High CPU Usage Due to Orphaned VM Processes on Hypervisor Nodes

Problem

In some environments, compute nodes may experience elevated CPU utilization caused by orphaned virtual machine (QEMU) processes. This occurs when certain instances are deleted or migrated at the control-plane level (Compute DB), but their corresponding QEMU processes continue to run on the hypervisor.

Symptoms include:

QEMU processes consuming CPU although the VM no longer exists in the database.
Mismatch between the number of instances reported in Nova DB versus running on the hypervisor.
Periodic warnings in nova.compute.manager logs during instance power-state synchronization.

Environment

Private Cloud Director Virtualization - v2025.6 and Higher
Private Cloud Director Kubernetes – v2025.6 and Higher
Self-Hosted Private Cloud Director Virtualization - v2025.6 and Higher
Self-Hosted Private Cloud Director Kubernetes - v2025.6 and Higher
Component: Compute Service

Cause

Compute Service periodically performs a synchronization cycle where it validates running instances on the hypervisor against entries in the Compute database.

In this case:

Several VMs were deleted or migrated, but their QEMU processes persisted on the source hypervisor.
These orphaned processes continued running because the default Nova behavior did not automatically clean them up.
As a result, CPU utilization on the affected hypervisor increased unnecessarily.

Additionally, frequent instance deletions/migrations caused temporary discrepancies in instance counts during sync cycles, contributing to repeated warnings in the logs.

Diagnostics

Power State Synchronization Warnings

Compute service logs indicate mismatches between DB-reported VM count and hypervisor-reported VM count:

$ less /var/log/pf9/ostackhost.log
WARNING nova.compute.manager [...] While synchronizing instance power states, found 83 instances in the database and 86 instances on the hypervisor.

These messages demonstrate:

The presence of additional QEMU processes not tracked in DB.
Consistent mismatch over multiple cycles.

Resolution

Below configuration is included as part of standard deployment from PCD - v2025.10-180 and higher

To ensure Compute Service automatically cleans up orphaned QEMU processes, the following configuration was added to /opt/pf9/etc/nova/conf.d/nova_override.conf on all affected compute hosts:
Hypervisor Host
```
$ vi /opt/pf9/etc/nova/conf.d/nova_override.conf 
[DEFAULT]
running_deleted_instance_action = reap
```
Restart the pf9-ostackhost service post adding the above change on the hypervisors. The restart of this service will also cleanup any VMs which are stuck in the deleting phase as per the Compute service database.
Hypervisor Host
```
$ sudo systemctl restart pf9-ostackhost
```
Effect of this configuration: When Nova detects an instance running on the hypervisor that does not exist in the Compute Service database:
1. Frees compute node CPU/memory resources.
2. Ensures hypervisor state aligns with Nova DB state.
3. Orphaned VMs are now automatically removed within the standard 30-minute sync cycle

PreviousVM Evacuation Fails Due to Volume Stuck in attaching State NextHow to Setup resource allocation ratios in PCD for the hosts in cluster?

Last updated 4 days ago

Good afternoon

hashtagProblem

hashtagEnvironment

hashtagCause

hashtagDiagnostics

hashtagResolution

Problem

Environment

Cause

Diagnostics

Resolution