Deep Dive: Virtual Machine High Availability
In any enterprise-grade cloud offering, Virtual Machine High Availability (HA) is a must have feature. Until now, HA features were available only on virtual infrastructures like VMware, Hyper-V, etc. With Platform9 managed OpenStack, customers can now use the same capability in their KVM environment as well.
To try out this new capability, please contact our Support team.
Architecture
Admins can enable HA per availability zone (AZ). AZ a logical construct defined by OpenStack Nova. An AZ is basically a specific metadata information attached to a host aggregate. All hypervisors belong to the ‘nova’ availability zone by default. Tagging a host aggregate with the ‘availability_zone’ metadata key creates an availability zone with that name. Host aggregates are annotated to form an AZ. All the aggregate hosts get implicitly moved from the default ‘nova’ zone to this user specified zone. A hypervisor can belong to multiple aggregates, but it can only belong to a single AZ at any given time. Platform9 currently supports HA at the AZ level where given AZ maps to a single host aggregate. Hosts that are added to or removed from such an aggregate will automatically be protected against failures by reconfiguring the HA cluster as needed. Cloud admins can enable HA on an availability zone as shown below.
Failure Recovery
Consul uses raft — a scalable gossip protocol to detect host failures. All the nodes in a cluster participate in failure detection. This service runs on all nodes that are part of the HA cluster. Each cluster node assumes either the client or server role. Once a host crashes or stops responding, the cluster state is updated using distributed consensus protocol. This is then reported to the OpenStack Masakari service running on the Platform9 controller. It orchestrates the recovery of workloads running on that host.
It takes the Consul cluster 3 to 5 minutes to detect that a host has failed and needs evacuation. After the host failure is reported to Masakari controller, the recovery time depends on factors such as network bandwidth and other pending compute tasks before OpenStack Nova can start evacuating instances from the failed host.
The wait period protects against undesired evacuations on temporary outages or host reboots.
Configuration
- Consul IP Address: The IP address that Consul binds to on all hypervisor hosts part of some HA cluster is, by default the same as the one you configured for VNC proxy when authorizing the Hypervisor role. You need to choose an IP address that is reachable by all other nodes in the cluster. From a security standpoint, please make sure that this IP address is not accessible on the public network. If you wish to use a different IP address for Consul than the one you configured for VNC, you can contact our Support team to help you configure it before HA is enabled on that host.
- Ports: Consul requires up to four different ports to work properly, some on TCP, UDP, or both protocols. If you have any firewalls, be sure to allow both protocols for following ports.
- Server RPC (Default 8300). This is used by servers to handle incoming requests from other agents. TCP only.
- Serf LAN (Default 8301). This is used to handle gossip in the LAN. Required by all agents. TCP and UDP.
- CLI RPC (Default 8400). This is used by all agents to handle RPC from the CLI. TCP only.
- HTTP API (Default 8500). This is used by clients to talk to the HTTP API. TCP only.
Other Considerations
- A minimum of 3 healthy nodes are needed for HA at all times. Please make sure that is the case after recovery.
- All hosts should be configured to use shared storage for instances.
References
- Masakari Project: https://wiki.openstack.org/wiki/Masakari
- Consul: https://www.consul.io/
- Beyond Kubernetes Operations: Discover Platform9’s Always-On Assurance™ - November 29, 2023
- KubeCon 2023 Through Platform9’s Lens: Key Takeaways and Innovative Demos - November 14, 2023
- Getting to know Nate Conger: A candid conversation - June 12, 2023