VM HA Heartbeat Failures Triggering VM Evacuations

Problem

  • VMHA heartbeat failures led the vmha-agent to treat the host as unavailable, which in turn triggers a VM evacuation.

Environment

  • Private Cloud Director Virtualization - v2025.4 and Higher

  • Self-Hosted Private Cloud Director Virtualization - v2025.4 and Higher

  • Component - VM HA

Diagnostics

  1. Check the VM HA agent logs on the host at /var/log/pf9/pf9-vmha-agent.log. Verify whether the logs are repeatedly reporting errors similar to the example below for multiple different hosts.

level=ERROR msg="unable to get metrics from libvirt exporter" error="Get \"http://10.x.x.x:9177/metrics\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
level=ERROR msg="unable to get metrics from libvirt exporter" error="Get \"http://10.x.x.y:9177/metrics\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
level=ERROR msg="unable to get metrics from libvirt exporter" error="Get \"http://10.x.x.z:9177/metrics\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
  1. Check libvirt exporter connectivity which checks the inter-host connectivity

# sudo curl http://localhost:9177/metrics

This curl request should return a response within 5 seconds. If it times out or takes longer, it indicates increased latency or a potential connectivity issue.

  1. Check connectivity with HAManager in management plane from the host

Resolution

  • If both conditions 1 and 2 are met for a host, the HA agent may interpret this as degraded or lost inter-host connectivity. Elevated latency or network disruptions can lead the agent to incorrectly classify the host as unavailable. Since the polling cycle must complete within 5 seconds, any delay beyond this threshold should prompt an investigation into network communication between hosts, including checks for latency, packet loss, or traffic being blocked.

  • If condition 3 is met, it suggests that connectivity between the host and the management plane (used by the HA Manager) is also failing. Resolving this connectivity failure is required to re-establish proper HA functionality and prevent further service disruption.

Last updated