VMHA is Stuck in ErrorRemoving state in the PCD GUI

Problem

  • After adding host to the PCD cluster, the VMHA status in the cluster section turned to be Error Removing .
  • VMHA becomes non-functional after adding a node to the respective cluster.

Environment

  • Private Cloud Director Virtualization - v2025.4 and Higher
  • Self-Hosted Private Cloud Director Virtualization - v2025.4 and Higher
  • VMHA

Cause

  • A stale compute-service entry was still listed in Nova's service records. The same stale host entry was being retrieved by the availability zone. Because of this, VMHA tried to use that host during setup, which caused an error and left the VMHA stuck in the ErrorRemoving state.

Diagnostics

For SAAS customers contact Platform9 Support Team to validate if you are hitting the issue mentioned in this article.

  • Review VMHA logs for any errors being logged during performing the disable-enabled action with VMHA over PCD UI. Capture the VMHA server logs from hamgr pod running inside affected region. The logs are present inside /var/log/pf9/hamgr/
Command
Copy
  • Capture the errors in hamgr logs.
hamgr.log
Copy
  • List compute services and availability zones and validate if there is the any entry of the HOST_UUID captured in the error logs of hamgr. Review its State and Service Status . In the sample output the HOST2.EXAMPLE.COM is the decommissioned node that was not properly removed.
Command
Copy
Command
Copy

Resolution

  • Identify the stale compute service entry from the output of the below command, in the sample output we see the service HOST2.EXAMPLE.COM is down.
command
Copy
  1. Delete the stale service using below command:
command
Copy
  1. Wait for the VMHA to retry the operation automatically, or disable and re-enable VMHA to trigger a fresh reconcile attempt.

Validation

  • Ensure VMHA state transitions from ErrorDeleting to Enabled.
  • Confirm no additional stale hosts remain.
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard