Considerations for High Availability and Infrastructure Uptime in distributed environments

High Availability (HA) and uptime may be similar concepts, but they’re not the same. Availability refers to how accessible your server is or how many people can connect to it; while uptime is about whether your server is even reachable. What this means is that if even a single connection to your server is possible, your server is officially “up,” even though it might be unavailable to a number of users. Additionally, uptime is an infrastructure-level capability, while HA typically refers to the application level.

Recognize your requirements

Make no mistake – both application HA and infrastructure uptime are core capabilities that organizations require in order to remain competitive and relevant, irrespective of industry. The required levels of HA and uptime, however, differ across industries. While there are some applications that only need to be up about 95% of the time, others require at least 99%; still others, 99.99%; and life and death applications like financial, medical or air traffic applications often require 100% availability.

Depending on your requirements, both the cost and the ideal approach are going to be drastically different, so identifying your requirements and measuring your current availability and uptime, needs to be step one. This can include monitoring hardware and components, tracking complex business processes, calculating how much money is lost at downtime and getting in touch with customers to understand how downtime affects them. Remember, there needs to be a cost-benefit to increasing availability and uptime.

Distributed environments

In addition to the fact that containers, VMs, and bare-metal servers all approach the problem of availability differently, distributed environments with multiple data centers, including edge locations, further complicate an already complex situation. A good analogy here is the escalator vs. staircase. While a staircase is made of concrete, has no moving parts, and is up and available 100% of the time, an escalator requires maintenance and downtime on a regular basis.

Similarly, hybrid environments consisting of different technologies and countless moving parts need to account for infrastructure diversity, unpredictable demands, and failure recovery right from the get-go. While an easy way to do this is obviously to employ a hybrid platform like OpenStack or a managed solution like Platform9, SANless failover clustering solutions are increasing in popularity due to their comparatively lower cost. In addition to working across private, public, and hybrid clouds, SANless failover clustering provides carrier-class HA for both Windows and Linux servers.

Bare metal considerations

Contrary to popular belief, a lot of people still run applications directly on bare-metal servers, especially for use cases like big data with high computing or high privacy requirements. While being a sole tenant with root access is great since you have no noisy neighbors and one less software layer to worry about, there’s a lot more responsibility to shoulder; unless of course, you go with a vendor that provides bare metal as a service.

These responsibilities include maintaining HA/uptime by ensuring no single point of failure, configuring redundant and failover servers, and employing a load balancer to direct traffic between your servers. Different algorithms are used for stateful and stateless applications like hashing or hash buckets; which is a basic tier-two system for stateless, and more complex algorithms like round-robin, server probe, and weighted distribution for stateful applications. Software load balancers like HAProxy or Nginx, or hardware load balancers like F5 or Barracuda are also good options.

For further redundancy, an additional load balancer can also be employed for stateless applications with a tool like Elastic IP for IP mapping, since switching load balancers involves a DNS change. Additionally, bare-metal servers outperform VMs running containers in Kubernetes. So if you’re going the microservice route, bare metal with an HA Kubernetes cluster is not a bad idea.

Virtual Availability and Uptime

With virtual machines, it can get a bit confusing since HA, here, refers to availability at a hypervisor level, which is sort of like infrastructure uptime. At an application level, it’s called App HA. This is why you have separate solutions like VMware HA that’s built on failover clustering, and works by pooling resources and restarting failed VMs on alternative hosts; and vSphere App HA that lets you define availability for the applications running in your VMs.

Key features to look for in a solution for uptime include VM monitoring, live migration, and checkpoint restore. In the case of host failures – like if an ESXi host fails and “heartbeats” are no longer sent – the infrastructure HA solution restarts the VM on an alternative host. If a VM fails, however, or a Guest OS inside a VM fails, the VM is restarted on the same host. Similarly, at an application level, heartbeats between the app and server are monitored. In case of a failure, the VM is also restarted on the same host, following an app restart, of course.

Container environments

Containers in the enterprise generally run on Kubernetes, and HA on Kubernetes is all about setting up in a way that there is no single point of failure. This “HA environment” is usually in the form of a multi-master cluster with at least three masters so that all critical components are replicated in case of failure. As opposed to hypervisor-based environments that need to be restarted, containers are ephemeral and only new ones can be started from an immutable image.

Kubernetes applications are stateless and all data and configurations are stored independently in the form of ConfigMaps, Secret Objects, and PersistentVolumes; so a failed container doesn’t affect the system at all. Kubernetes also uses ReplicaControllers to ensure a set number of replicas of a pod are always running so that in the case of failure, the pod can instantly be restarted on another host. Setting up at least a three-node etcd cluster with Kubeadm is also recommended for HA with Kubernetes.

In conclusion, while there are a number of ways to achieve different levels of infrastructure uptime and application availability, a lot depends on your requirements. For most applications without unique requirements, having a cloud provider or managed service that oversees servers, geographic locations, load balancing, service discovery and everything else that causes IT nightmares, is definitely the way to go for the best HA and infrastructure uptime.

Platform9

You may also enjoy

Kubernetes FinOps: Right-sizing Kubernetes workloads

By Joe Thompson

Mastering the operational model challenge for distributed AI/ML infrastructure

By Kamesh Pemmaraju

The browser you are using is outdated. For the best experience please download or update your browser to one of the following:

Leaving VMware? Get the VMware alternatives buyer's guideDownload now