Virtual Machine High Availability (VM HA)
In this document, you will learn about Private Cloud Director Virtual Machine High Availability (VM HA), a feature that automatically detects physical host failures within a cluster and restarts the affected VMs on other healthy hosts in the same cluster.
Introduction
Hardware, firmware, or network issues can take a virtualization host offline with little warning. Without an automated recovery mechanism, every VM on that host would remain down until an operator intervenes, leading to a scenario that violates most service-level objectives.
Virtual Machine High Availability (VM HA) protects against this risk. The process is designed to be automatic and requires minimal manual intervention during a failure event:
- Continuous Host Monitoring: VM HA service constantly monitors the health and responsiveness of all hypervisor hosts participating in an HA-enabled virtualized cluster.
- Failure Detection: If a host stops responding (due to hardware failure, operating system crash, or certain network isolation scenarios), the system detects the failure
- Automatic VM Recovery: Upon confirmation of a host failure, which involves verifying the failure on both the management plane and cluster hosts, VM HA automatically initiates the process of restarting any VMs running on the failed host. These VMs are powered on using available resources on the remaining healthy hosts within the cluster.
After recovery, complementary features such as Dynamic Resource Rebalancing (DRR) can redistribute load to restore optimal balance across the cluster, ensuring sustained performance.
Benefits of VM HA
Enabling VM HA in your Private Cloud Director environment delivers the following benefits:
- Minimized Downtime: Automatically restarts or evacuates VMs when a host fails, reducing mean-time-to-recovery from hours to minutes.
- Service Continuity: Keeps business-critical applications online even during unexpected infrastructure outages.
- Operational Efficiency: Eliminates the need for round-the-clock manual monitoring and intervention, freeing administrators for higher-value tasks.
- Policy-Driven Control: Respects host aggregates, affinity/anti-affinity rules, and VM-specific settings, allowing you to decide how each workload is treated during failover.
- Seamless Interoperation: Works in concert with DRR and other Private Cloud Director services to maintain both availability and resource efficiency after a failure event.
VM HA Pre-Requisites
VM HA always operates in the context of a virtualized cluster. You need to turn it on at the cluster level. Once turned on, VM HA applies to all virtual machines within the cluster.
A minimum of two hosts with a hypervisor role in a cluster.
Shared storage is required for VM HA. For VM HA to work at the cluster level:
- All VMs should be using block storage volume as the root disk (non-ephemeral root disk), or
- If any VMs are using ephemeral storage for root disk, then Ephemeral Shared Storage should be used for all hosts in the virtualized cluster.
- When Ephemeral Shared Storage is not used, any VMs using ephemeral root disk will get rebuilt as part of recovery on another host.
VM HA requires a minimum number of healthy hosts in a cluster to function correctly. A minimum of two hosts is required for HA activation.
VM HA uses the VM Evacuation operation behind the scenes. VM Evacuation Prerequisites must be met for the operation to succeed.
Configure VM HA for a Cluster
You can enable VM HA when you create a new cluster by navigating to Infrastructure > Clusters > Add Cluster and turning on the VM High Availability toggle.
Alternatively, you can enable VM HA on an existing cluster by editing it and turning on the VM High Availability toggle.
On the Clusters page, you will see one of the following VM HA statuses for each cluster:
- Disabled: VM HA is disabled for the cluster.
- Waiting: WM HA is enabled for, but is not active for the cluster. This occurs when fewer than two hosts with hypervisor roles have been assigned to the cluster. Once two or more hypervisor hosts are added to the cluster, the status changes to Active.
- Active: VM HA is enabled and active for the cluster.
How VM HA Works
Private Cloud Director Virtual Machine High Availability (VHMA) provides automated protection against host failures by continuously monitoring host health and, when needed, evacuating affected virtual machines (VMs) to healthy hosts with minimal disruption. Three cooperating services enable the automated workflow:
- High Availability Manager: Central coordinator that runs in the Control Plane and watches for clusters with VHMA enabled, gathers health reports, confirms host failures, and issues evacuation requests.
- Host Agents: Daemon that runs on every hypervisor host and probes its peers’ liveness and reports findings to the High Availability Manager at regular intervals.
- VM Evacuation Service: Orchestrator that runs in the Control Plane and receives confirmed host‑down events from HAMS and live‑migrates each impacted VM to a suitable target host.
End-to-end Flow
Cluster discovery: The High Availability Manager polls the infrastructure to discover clusters where VHMA is enabled. For VM HA-enabled clusters, it verifies the presence of the pf9-ha-slave role on every hypervisor host. Clusters with fewer than two hosts are ignored until additional hosts join.
Peer list distribution: Every 59 seconds, each Host Agent requests a peer list from the High Availability Manager. The list is a randomized subset of hosts in the same cluster, helping to spread traffic and avoid single‑point bias.
Distributed health probing: Using the peer list, the host agent performs a liveness check on its peer hosts via the libvert exporter. This check is entirely agent‑to‑agent, and no central polling path is involved.
Status aggregation & reporting: Every 129 seconds, the agent posts an aggregated view of peer health to the High Availability Manager.
Failure detection and correlation: When High Availability Manager receives a report that a host appears down, it cross‑checks the same host’s status in reports from other agents to avoid false positives while ensuring fast reaction to true outages. Once a host is confirmed down via peer agent reports, a 150‑second cooldown timer starts:
If the host recovers before the timer expires, the event is cleared, and no action is taken. This is so that routine reboots do not trigger VM evacuations.
If the host remains down when the timer ends, High Availability Manager confirms the failure and emits a host-down notification to the VM Evacuation Service.
Automated VM evacuation: Upon receiving the host‑down notification, the VM Evacuation Service verifies the host’s state, collects all resident VMs, and evacuates them one VM at a time to a healthy host in the cluster. Target selection honors existing placement constraints (such as aggregates or affinity rules).
Continuous protection loop: After the evacuation completes, normal monitoring resumes. Because Host Agents keep probing the recovered host (if it comes back online) and all remaining hosts, VHMA maintains an always‑on protection loop with no manual intervention. Please note that if a host that went offline comes back online, VM HA will not automatically migrate back the VMs that were originally located on this host. The Dynamic Resource Rebalancing (DRR) service will make VM balancing decisions independently if resource contention occurs on any host.
VM HA Interoperation with other services
This section describes how VM HA interoperates with other services configured for your cluster.
Host Aggregates
VM HA will honor Host Aggregates and migrate VMs to another host from the same host aggregate.
Limitations:
- All hosts from the Host Aggregate must belong to a single cluster and not span multiple clusters.
- If the host aggregate has a single host that has gone down, VM HA will not find a suitable target to migrate the VMs, and the VMs will enter an error state.
DRR
DRR and Virtual Machine High Availability are designed to interoperate well together. A host failure event may occur while DRR is actively rebalancing VMs, either from the same host or from other hosts in the cluster. When this happens:
- VM HA will detect the host failure and initiate VM evacuations
- The VM evacuations may result in cluster imbalance
- DRR will then detect the imbalance during either the current run or the next run.
- DRR will redistribute the load across the cluster to address the imbalance.
VMs with Hard Affinity or Anti-Affinity Rules
- Hard affinity: VM HA will identify a host with sufficient capacity to host all VMs in the affinity group and migrate all VMs sequentially to the target host. If VM HA cannot find a suitable target host, the evacuation will fail, and the failure status will be reported on the UI.
- Soft affinity: VM HA will identify a host with sufficient capacity to host all VMs in the affinity group and migrate all VMs sequentially to the target host. If VM HA cannot find a suitable target host, VM HA will migrate VMs to available hosts, potentially violating the soft affinity policy. Any policy violations will be reported on the UI.
- Hard anti-affinity: VM HA will attempt to find target hosts that satisfy the anti-affinity policy for each VM to be migrated, such that all VMs that are part of the hard anti-affinity policy will be placed on separate hosts. If VM HA cannot find a suitable target host, the evacuation will fail, and the failure status will be reported on the UI.
- Soft anti-affinity: VM HA will attempt to find target hosts that satisfy the anti-affinity policy for each VM to be migrated, such that all VMs that are part of the soft anti-affinity policy will be placed on separate hosts. If VM HA cannot find a suitable target host, the evacuation will fail, and the failure status will be reported on the UI. If VM HA finds a target host that violates the soft affinity policy, the VM will be evacuated to the host, and the policy violation will be displayed on the UI.
Note that if VM evacuation fails in any of the scenarios listed above, the VM might still be shown on the UI as active.
VM States
VM states will be maintained as VM HA brings up VMs on the new host (for example, a VM in shutdown status will be migrated to the new host in the same shutdown status), with the following exception:
- VMs in a suspended state will be brought up on the new host as active. This is a known issue and will be addressed in a future release.
VMs with Special Properties
- Virtual TPM-enabled VMs: VM HA does have the ability to live migrate VMs with Virtual TPM enabled today. This ability is coming soon.
- Zero flavor VMs: VM HA will evacuate VMs created with zero flavor.
- VMs with hot-added CPU or memory: VM HA will evacuate VMs that have hot-added CPU or memory resources.
- Resized VMs: VM HA will fail for a VM that has been resized but not confirmed yet.