Announcing Elastic Machine Pool: The most cost-effective Compute Engine for EKS

New compute Engine
We are excited to announce the early access availability of our ground-breaking innovative product called Elastic Machine Pool (EMP). Our latest innovation doubles EKS efficiency and saves 50% or more on AWS costs.

EMP is a new compute engine for EKS that runs on AWS Bare Metal. Designed for DevOps/FinOps teams, EMP delivers automated EKS cluster efficiency and cost savings – powered by virtualization and live rebalancing of cloud compute resources –  with no app disruptions or discussions needed with your development teams.

EKS utilization in practice is very low—typically less than 30%. This is the primary source of waste and accounts for 50% or more of AWS compute spend.

In our initial testing of EMP in a customer environment, utilization improved from 22% to 60% and corresponding AWS infrastructure costs went down by 50%. This proved the hypothesis that under-utilization is the main cause of high compute costs.

The EKS utilization conundrum

As they develop and test their applications, developers identify the minimum and maximum resources needed by the application and then configure ‘requests’ and ‘limits’ on application pods. Kubernetes/EKS uses that information to schedule pods onto different nodes, leveraging auto-scaling technology to add new nodes, if needed, to accommodate these pods.

However, real-world usage differs greatly from those requests, with significantly more idle time. This always results in inefficient resource utilization, and most EKS users rarely achieve 25% utilization.

EKS clusters – specifically with the use of auto-scaling technologies and fixed-size nodes – introduce additional challenges, further exacerbating the utilization problem.

The sizing problem

Pods within EKS clusters often misalign with available EC2 sizes. For a new pod that requires 2.5GB RAM, the autoscaler ends up using an instance of size 4GB because the closest available EC2 sizes are 2GB and 4GB.

Fragmentation over time

This is sometimes referred to as the bin-packing problem. The dynamic nature of adding and removing pods causes fragmentation, resulting in a suboptimal node allocation and resource waste within EKS.

Improper resource settings

To ensure Quality of Service (QoS), app developers set very high requests and limits; in some cases, developers set resource requests equal to the limits. Because average usage is typically lower than peak usage, a large portion of resources go unused.

Siloed teams

In an organization, the FinOps team owns costs, the DevOps team monitors utilization and owns resource optimization, and AppDev is responsible for suggesting resource configuration, i.e. set requests and limits. This organizational structure hinders improvements in utilization and/or cost optimization.

Change resistance

There is a conflict between the need for AppDev teams to guarantee application SLA (controlled by app resource configuration) and the DevOps team’s desire to right-size the infrastructure, resulting in sub-optimal configuration.

App disruptions

Even when everyone agrees on the best configuration, applying changes may cause minor or major disruptions in the application. Various tools recommend right-sizing instances, which results in pod restarts, which in turn can cause disruptions, application downtime, and performance hits.

Dynamic environments

In a world of rapidly changing software, usage patterns shift frequently, rendering occasional optimizations quickly obsolete. There is no automated feedback loop to account for ever-changing conditions. As a result, the resource configuration is updated infrequently, leaving unrealized gains.

Current solutions and their shortcomings

All of the above pain points are well known, and various attempts have been made to address them using the approaches listed below:

Pod sizing

A typical solution involves observing the pod’s behavior and determining the appropriate ‘request’ and ‘limits’. Some open-source tools, such as VPA, will either report or apply the recommendations. Various commercial tools offer similar capabilities, with some being more accurate, more performant, or a combination of the two.

Instance right-sizing and bin-packing

Instances hosting pods may have unused capacity due to pod resizing or the dynamic nature of the workloads. Certain tools, such as Karpenter, can right-size the instances and shuffle the Pods around to provide the best pod-to-VM placement.

Some tools provide one or the other, while others provide both. They are frequently combined with visibility into AWS cost savings. While the solutions mentioned above are good and solve a portion of the problem, they leave much to be desired. Here are a few issues we have encountered.

Pod disruptions can cause downtime

Pods may restart if they are moved between nodes due to bin-packing optimization. Even dynamic resource adjustments can cause pods to restart. Not all applications use a 12-factor methodology or are completely stateless. Such applications are unable to handle forced restarts and outages.

Many enterprise apps that cache a large amount of data in memory and databases experience downtime when pods are reconfigured with a different resource configuration. Downtime can range from several seconds for simple apps to 5+ minutes for large monolithic apps (some applications may have a long startup time). For mission-critical systems, any disruption is unacceptable and may require an explicit downtime window to execute bin-packing and/or resizing.

Pod usage is dynamic

Pods have processes that may have periods of inactivity followed by periods of heavy usage. This makes right-sizing them impossible. Static request and limit configuration is not a practical solution. Existing solutions rely on horizontal scaling or make suboptimal decisions for pod sizing, which does not solve the utilization problem completely. So even if pods are perfectly packed, the utilization at the AWS instance/hardware level remains poor, as apps seldom fully utilize the provisioned capacity.

Introducing Elastic Machine Pool

Platform9’s Elastic Machine Pool (EMP) is an innovative new product specifically designed to target the core inefficiencies of EKS, i.e. low utilization. EMP tackles inefficient usage by leveraging hypervisor virtualization capabilities unavailable on AWS.

EMP architecture: A high-level overview

EMP employs a combination of virtualization technologies and other Kubernetes extensions to provide a seamless experience. The following diagram shows a high-level architecture for a given customer environment.

A diagram of the Elastic Machine Pool architecture

EKS currently provides two compute engines:

  1. EC2 instances: whether on-demand, reserved or spot, they are part of the same computing capacity type.
  2. Fargate: A serverless computing engine that hides ‘nodes’ of the cluster by managing them internally.

EMP is the new computing engine for EKS, which is far more cost-effective than either EC2 or Fargate. How? It utilizes proven virtualization technologies. Instead of creating EC2 VMs, it uses EC2 Bare Metal instances to create Elastic VMs (EVM) that can be joined to EKS as ‘nodes’. These EVMs leverage the same AWS infrastructure as EC2 VMs and have access to AWS infrastructure components like VPC, EBS, EFS, RDS, etc. Pods run on EVMs just like they run on EC2 VMs.

EVMs have a few advantages:

  • EVMs can be overprovisioned on a small cluster of bare-metal servers. Overprovisioning allows allocating more virtual resources than were physically available since not all VMs use maximum resources concurrently.
  • EVMs are continuously bin-packed onto bare metal servers using EMP live migration and auto-scaling technologies. Live migration enables seamlessly moving VMs between physical hosts without disruption.
  • EMP is designed to be AWS infrastructure compatible, which means no changes in the usage of AWS services like EBS, EFS, VPC, etc. EMP is compatible with EKS providing a seamless experience. Developers do not need to change their applications or learn any new tools or APIs. They do not even need to change the requests and limits for their apps anymore. EMP automatically optimizes in real time to ensure app performance and SLAs.

This results in best-in-class resource utilization and cost benefits.

This approach is unique in the industry, solving the problem at the root cause layer (bare metal and virtualization) rather than trying to fix the symptoms. EMP obviates the need to right-size at the pod or VM layer. This enables a level of efficiency and cost optimization not possible with existing tools and approaches.

Why EKS Clusters need EMP

The benefits of integrating EMP into EKS are manifold:

  • Unmatched Efficiency: EMP directly targets and rectifies inefficiencies in EKS VM instances at the hypervisor layer.
  • Seamless Integration: EMP is tailor-made for EKS, ensuring developers face no changes in their workflow.
  • Non-disruptive Operations: Thanks to virtualization, EKS applications remain stable during any infrastructural optimizations.
  • Broad Applicability: While EMP is optimized for EKS, its methodologies will be adapted for plain EC2 instances and will be extended to other public cloud platforms, amplifying its benefits.

In summary, as EKS continues to shape the future of container orchestration, maximizing its utilization is of paramount importance. Platform9’s EMP offers an avant-garde approach, ensuring businesses extract the utmost value from their EKS investments.

Dig Deeper

Are you ready to take action?

Let our team walk you through a live demo session, addressing your unique challenges and discussing how EMP can be tailored to your specific EKS needs. Don’t miss this chance to revolutionize your EKS cost optimization and achieve unmatched efficiency.

Book a personalized live demo of Elastic Machine Pool

Roopak Parikh

You may also enjoy

Kubernetes FinOps: Resource management challenges

By Joe Thompson

Run EKS workloads on 50% compute resources with Elastic Machine Pool

By Kamesh Pemmaraju

The browser you are using is outdated. For the best experience please download or update your browser to one of the following:

Leaving VMware? Get the VMware alternatives guideDownload now