VMHA is Stuck in ErrorRemoving state in the PCD GUI
Problem
After adding host to the PCD cluster, the VMHA status in the cluster section turned to be
Error Removing.VMHA becomes non-functional after adding a node to the respective cluster.
Environment
Private Cloud Director Virtualization - v2025.4 and Higher
Self-Hosted Private Cloud Director Virtualization - v2025.4 and Higher
Component: VMHA
Cause
A stale compute-service entry was still listed in Nova's service records. The same stale host entry was being retrieved by the availability zone. Because of this, VMHA tried to use that host during setup, which caused an error and left the VMHA stuck in the
ErrorRemovingstate.
Diagnostics
Info
For SAAS customers contact Platform9 Support Team to validate if you are hitting the issue mentioned in this article.
Review VMHA logs for any errors being logged during performing the disable-enabled action with VMHA over PCD UI. Capture the VMHA server logs from
hamgrpod running inside affected region. The logs are present inside/var/log/pf9/hamgr/
Capture the errors in hamgr logs.
List compute services and availability zones and validate if there is any entry of the
HOST_UUIDcaptured in the error logs of hamgr. Review itsStateandService Status. In the sample output the HOST2.EXAMPLE.COM is the decommissioned node that was not properly removed.
Resolution
Identify the stale compute service entry from the output of the below command. In the sample output we see the service
HOST2.EXAMPLE.COMis down.
Delete the stale service using below command:
Wait for the VMHA to retry the operation automatically, or disable and re-enable VMHA to trigger a fresh reconcile attempt.
Validation
Ensure VMHA state transitions from
ErrorDeletingtoEnabled.Confirm no additional stale hosts remain.
Last updated
