VMHA is Stuck in ErrorRemoving state in the PCD GUI
Problem
- After adding host to the PCD cluster, the VMHA status in the cluster section turned to be
Error Removing. - VMHA becomes non-functional after adding a node to the respective cluster.
Environment
- Private Cloud Director Virtualization - v2025.4 and Higher
- Self-Hosted Private Cloud Director Virtualization - v2025.4 and Higher
- VMHA
Cause
- A stale compute-service entry was still listed in Nova's service records. The same stale host entry was being retrieved by the availability zone. Because of this, VMHA tried to use that host during setup, which caused an error and left the VMHA stuck in the
ErrorRemovingstate.
Diagnostics
For SAAS customers contact Platform9 Support Team to validate if you are hitting the issue mentioned in this article.
- Review VMHA logs for any errors being logged during performing the disable-enabled action with VMHA over PCD UI. Capture the VMHA server logs from
hamgrpod running inside affected region. The logs are present inside/var/log/pf9/hamgr/
$ kubectl exec -it deploy/hamgr -n <REGION_NAMESPACE> -- bash- Capture the errors in hamgr logs.
vmha.hamgr.providers.nova ERROR Disable HA request failed for cluster cluster-nameTraceback (most recent call last):requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://bbmaster.region-name.svc.cluster.local:8082/v1/hosts/[HOST_UUID]/appsvmha.hamgr.db.api WARNING Task state being updated from removing to error-removingvmha.hamgr.providers.nova INFO process enable request for Availability zone REGION_NAMEvmha.hamgr.providers.nova WARNING Cluster REGION_NAME is running task error-removing, cannot enablevmha.__main__ INFO Traceback (most recent call last): File "/usr/local/lib/python3.9/site-packages/eventlet/wsgi.py", line 614, in handle_one_response result = self.application(self.environ, start_response) File "/usr/local/lib/python3.9/site-packages/webob/dec.py", line 129, in __call__- List compute services and availability zones and validate if there is the any entry of the
HOST_UUIDcaptured in the error logs of hamgr. Review itsStateandService Status. In the sample output the HOST2.EXAMPLE.COM is the decommissioned node that was not properly removed.
$ openstack compute service list --service nova-compute#sample output+--------------------+-------------+--------------------+-------+---------+-------+--------------+| ID | Binary | Host | Zone | Status | State | Updated At |+--------------------+-------------+--------------------+-------+---------+-------+--------------+| [HOST1_SERVICE_ID] | nova-compute| [HOST1.EXAMPLE.COM]| [zone]| enabled | up | [TIMESTAMP] || [HOST2_SERVICE_ID] | nova-compute| [HOST2.EXAMPLE.COM]| [zone]| disabled| down | [TIMESTAMP] || [HOST3_SERVICE_ID] | nova-compute| [HOST3.EXAMPLE.COM]| [zone]| enabled | up | [TIMESTAMP] |+--------------------+-------------+--------------------+-------+---------+-------+--------------+$ openstack availability zone list --compute --long#sample output+-----------+-------------+---------------+---------------------+----------------+----------------+| Zone Name | Zone Status | Zone Resource | Host Name | Service Name | Service Status |+-----------+-------------+---------------+---------------------+----------------+----------------+| [zone] | available | | [HOST1.EXAMPLE.COM] | nova-compute | enabled :-) | | [zone] | available | | [HOST2.EXAMPLE.COM] | nova-compute | enabled XXX | | [zone] | available | | [HOST3.EXAMPLE.COM] | nova-compute | enabled :-) |Resolution
- Identify the stale compute service entry from the output of the below command, in the sample output we see the service
HOST2.EXAMPLE.COMis down.
x
$ openstack compute service list#sample output+--------------------+-------------+--------------------+-------+---------+-------+--------------+| ID | Binary | Host | Zone | Status | State | Updated At |+--------------------+-------------+--------------------+-------+---------+-------+--------------+| [HOST1_SERVICE_ID] | nova-compute| [HOST1.EXAMPLE.COM]| [zone]| enabled | up | [TIMESTAMP] || [HOST2_SERVICE_ID] | nova-compute| [HOST2.EXAMPLE.COM]| [zone]| disabled| down | [TIMESTAMP] || [HOST3_SERVICE_ID] | nova-compute| [HOST3.EXAMPLE.COM]| [zone]| enabled | up | [TIMESTAMP] |+--------------------+-------------+--------------------+-------+---------+-------+--------------+- Delete the stale service using below command:
$ openstack compute service delete <HOST2_SERVICE_ID>- Wait for the VMHA to retry the operation automatically, or disable and re-enable VMHA to trigger a fresh reconcile attempt.
Validation
- Ensure VMHA state transitions from
ErrorDeletingtoEnabled. - Confirm no additional stale hosts remain.
Was this page helpful?