VMHA is Stuck in ErrorRemoving state in the PCD GUI

Problem

After adding host to the PCD cluster, the VMHA status in the cluster section turned to be Error Removing .
VMHA becomes non-functional after adding a node to the respective cluster.

Environment

Private Cloud Director Virtualization - v2025.4 and Higher
Self-Hosted Private Cloud Director Virtualization - v2025.4 and Higher
VMHA

Cause

A stale compute-service entry was still listed in Nova's service records. The same stale host entry was being retrieved by the availability zone. Because of this, VMHA tried to use that host during setup, which caused an error and left the VMHA stuck in the ErrorRemoving state.

Diagnostics

For SAAS customers contact Platform9 Support Team to validate if you are hitting the issue mentioned in this article.

Review VMHA logs for any errors being logged during performing the disable-enabled action with VMHA over PCD UI. Capture the VMHA server logs from hamgr pod running inside affected region. The logs are present inside /var/log/pf9/hamgr/

Command
    
 
$ kubectl exec -it deploy/hamgr -n <REGION_NAMESPACE> -- bash
Copy

Capture the errors in hamgr logs.

hamgr.log
    
vmha.hamgr.providers.nova ERROR Disable HA request failed for cluster cluster-nameTraceback (most recent call last):requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://bbmaster.region-name.svc.cluster.local:8082/v1/hosts/[HOST_UUID]/appsvmha.hamgr.db.api WARNING Task state being updated from removing to error-removingvmha.hamgr.providers.nova INFO process enable request for Availability zone REGION_NAMEvmha.hamgr.providers.nova WARNING Cluster REGION_NAME is running task error-removing, cannot enablevmha.__main__ INFO Traceback (most recent call last):  File "/usr/local/lib/python3.9/site-packages/eventlet/wsgi.py", line 614, in handle_one_response    result = self.application(self.environ, start_response)  File "/usr/local/lib/python3.9/site-packages/webob/dec.py", line 129, in __call__
Copy

List compute services and availability zones and validate if there is the any entry of the HOST_UUID captured in the error logs of hamgr. Review its State and Service Status . In the sample output the HOST2.EXAMPLE.COM is the decommissioned node that was not properly removed.

Command
    
 
$ openstack compute service list --service nova-compute#sample output+--------------------+-------------+--------------------+-------+---------+-------+--------------+| ID                 | Binary      | Host               | Zone  | Status  | State | Updated At   |+--------------------+-------------+--------------------+-------+---------+-------+--------------+| [HOST1_SERVICE_ID] | nova-compute| [HOST1.EXAMPLE.COM]| [zone]| enabled | up    | [TIMESTAMP]  || [HOST2_SERVICE_ID] | nova-compute| [HOST2.EXAMPLE.COM]| [zone]| disabled| down  | [TIMESTAMP]  || [HOST3_SERVICE_ID] | nova-compute| [HOST3.EXAMPLE.COM]| [zone]| enabled | up    | [TIMESTAMP]  |+--------------------+-------------+--------------------+-------+---------+-------+--------------+
Copy

Command
    
 
$ openstack availability zone list --compute --long#sample output+-----------+-------------+---------------+---------------------+----------------+----------------+| Zone Name | Zone Status | Zone Resource | Host Name           | Service Name   | Service Status |+-----------+-------------+---------------+---------------------+----------------+----------------+| [zone]    | available   |               | [HOST1.EXAMPLE.COM] | nova-compute   | enabled :-)    |  | [zone]    | available   |               | [HOST2.EXAMPLE.COM] | nova-compute   | enabled XXX    |  | [zone]    | available   |               | [HOST3.EXAMPLE.COM] | nova-compute   | enabled :-)    |
Copy

Resolution

Identify the stale compute service entry from the output of the below command, in the sample output we see the service HOST2.EXAMPLE.COM is down.

command
    
​x
 
$ openstack compute service list​#sample output+--------------------+-------------+--------------------+-------+---------+-------+--------------+| ID                 | Binary      | Host               | Zone  | Status  | State | Updated At   |+--------------------+-------------+--------------------+-------+---------+-------+--------------+| [HOST1_SERVICE_ID] | nova-compute| [HOST1.EXAMPLE.COM]| [zone]| enabled | up    | [TIMESTAMP]  || [HOST2_SERVICE_ID] | nova-compute| [HOST2.EXAMPLE.COM]| [zone]| disabled| down  | [TIMESTAMP]  || [HOST3_SERVICE_ID] | nova-compute| [HOST3.EXAMPLE.COM]| [zone]| enabled | up    | [TIMESTAMP]  |+--------------------+-------------+--------------------+-------+---------+-------+--------------+
Copy

Delete the stale service using below command:

command
    
 
$ openstack compute service delete <HOST2_SERVICE_ID>
Copy

Wait for the VMHA to retry the operation automatically, or disable and re-enable VMHA to trigger a fresh reconcile attempt.

Validation

Ensure VMHA state transitions from ErrorDeleting to Enabled.
Confirm no additional stale hosts remain.

Last updated on

Was this page helpful?