Kubernetes FinOps: Comparing Platform9 Elastic Machine Pool to the Karpenter Autoscaler for EKS

A popular tool in today’s FinOps landscape for managing EKS clusters is the Karpenter autoscaler. Created at AWS, Karpenter has more capabilities than simple “0 to N” autoscaling – in fact, at Platform9 we’ve sometimes heard people especially enamored of it say Karpenter solves efficiency and scaling problems in EKS altogether. As we’ll see below, though, while it’s a valuable tool, and one with obvious advantages over the autoscalers that came before it, it’s not quite the for-all-time solution many seem to think it is.

Background: Kubernetes Cluster Autoscaler (CAS) and its shortcomings

Early in the development of Kubernetes, the project recognized the need to scale clusters dynamically to deal with increased or decreased deployment of workloads in a DevOps-friendly way and addressed it with the Cluster Autoscaler. While the details of how CAS works today are complex and incorporate awareness of scheduling controls like priority and disruption budgets, at base the concept is very simple: if there are no nodes with enough resources to run workloads waiting to schedule, scale up one or more cloud-provider-managed node groups; if there is a large amount of wasted capacity, scale down node groups to suit.

If we use the metaphor of a talk at a conference with more attendees trying to see it than will fit in the room, an autoscaler is like venue staff who notice the folks standing in the hall craning their necks, and open a partition to provide more space for them to get in out of the hallway. Likewise, if a room is only half-full, the venue staff might close a partition to allow the unused space to be used for another talk without interfering with the first one.

This model addresses the problem of too little (or too much) cluster for the size of the work to be done in a basic way – but CAS will only scale up with similar nodes to existing ones. This means if you have one small un-schedulable pod in a cluster built on scaling groups of large nodes, CAS will go ahead and create a new large node to run that tiny workload if it can’t enable it to run any other way. Likewise, if you have large nodes and one is mostly empty but the others don’t have enough spare capacity between them to run that node’s workloads, CAS will do nothing. This potentially still leaves you spending money on resources you never use.

In addition to the above, CAS requires the scaling node groups to be created in the cloud provider, outside of CAS, and then set up in its configuration, increasing the administrative burden of handling diverse workloads (for example ones that require extra-large nodes, nodes with GPUs, or nodes with different CPU architectures) in the cluster.

How Karpenter works and its advantages over CAS

Karpenter’s model is a simple but effective refinement over that used by CAS: Karpenter will provision nodes of varying size as needed, without needing to create node groups in the provider first (though you can still use provider-managed node groups alongside Karpenter if you want to). If the workloads needed to run in the existing resource footprint are too large or require different types of nodes than the ones with available resources, and some pods can’t be scheduled, Karpenter will create new nodes of the necessary type that are just large enough (maybe even larger than the existing nodes, if needed) to overcome the deficiency and let those pods finally start.

This has a couple of noticeable side effects – you can end up with a lot more nodes as you let your cluster autoscale, and they can be different sizes. To compensate for that, and to handle general decreases in utilization, Karpenter also tries to remove, consolidate, and replace nodes so that the same workloads run at a lower cost – either in fewer nodes of a given size or in nodes of smaller sizes. Naturally, this is liable to incur some pod disruption, so (like CAS), Karpenter tries to respect controls like disruption budgets and avoid removing or replacing nodes if disruption budgets would be violated. Karpenter also tries to avoid adding nodes where that won’t actually allow any unscheduled workloads to schedule, by running a scheduling simulation prior to provisioning – for example, if affinity rules on the unscheduled pods in the scheduling queue require them to run on specific existing nodes that don’t have enough resources to run them, Karpenter will not provision new nodes because the unscheduled pods still wouldn’t be able to run even if it did so.

Karpenter also has a feature that reflects its birth in the early days of FinOps as a named practice: unlike CAS, it can directly monitor and react to Spot instance termination notices, so when a Spot instance is about to be reclaimed, Karpenter will go ahead and add capacity in advance if needed to allow its pods to reschedule immediately, and begin the process of draining the node right away to allow the maximum margin for clean termination and fast rescheduling – allowing you to take advantage of Spot instance savings for workloads that can handle the inevitable disruption of Spot reclaims.

Common Karpenter snags

Right at the outset of this section it’s worth noting that while Karpenter currently supports only AWS, Cluster Autoscaler supports dozens of cloud providers, so if you’re not using AWS you can’t really benefit from Karpenter yet – but Karpenter is a lot newer than CAS, so this could change in the future.

One issue that trips new Karpenter users up pretty regularly is that (like CAS) Karpenter looks for un-schedulable workloads that could be scheduled, does the “scheduling math” to determine what to provision to make them actually schedulable, and then adds that capacity to the cluster, but it doesn’t do the actual workload scheduling.

In our earlier metaphor of the conference talk, Karpenter opens up additional space and tells the conference organizers that they can now use it, but it’s up to the organizers to actually seat the overflow attendees there.

This can lead to situations where more new capacity is allocated than pods end up actually being scheduled to, so one or more new nodes end up getting disrupted right back away without ever running workloads. Even though this compensation usually happens pretty quickly, so the actual extra expense incurred is relatively minimal, it can be disconcerting since the point of Karpenter is to reduce exactly that kind of overprovisioning waste.

Karpenter also, and really through no fault of its own, can conflict with other autoscaling components – for example, the Karpenter docs warn that although Karpenter can coexist with the Cluster Autoscaler, the AWS Node Termination Handler often deployed alongside CAS to gracefully handle Spot instance terminations can cause problems and should be removed, “due to conflicts that could occur from the two components handling the same events”. If you just replaced CAS with Karpenter, forgot (or didn’t know) to remove the Node Termination Handler, and ran into issues, you might think Karpenter is behaving badly, but really neither component is doing anything wrong – it’s just a consequence of two things reacting to the same events in ways that interfere with each other.

Lastly, by design, Karpenter is a very capable cluster autoscaler – more so than the standard Kubernetes Cluster Autoscaler in several important respects – but it’s not more than that, nor, if you read its documentation, does it claim to be. Most importantly, it’s not a workload scaler: it will neither try to control the number of replicas of your workloads in the cluster like the Horizontal Pod Autoscaler, nor will it try to manipulate the resource configuration of those workloads like the Vertical Pod Autoscaler. If resource requests are poorly-configured and some pods end up unschedulable even though the cluster is significantly under-utilized, Karpenter will still add nodes with enough available resources, based on the amount of those requests, to allow them to schedule and run – but it won’t reconfigure the unschedulable pods themselves so that you can reduce wasted allocation and run some or all of those pods in your existing cluster footprint in the first place. The Kubernetes resource-management problem, in a nutshell, isn’t about the size or number of nodes in the cluster per se: it’s about the workloads running on those nodes and whether the amount of resources allocated to them is well-matched to their needs – which can change significantly over time.

How EMP bridges the workload efficiency gap

If Karpenter can’t make your workloads themselves more efficient, how do you reduce the biggest source of waste in your cluster – the unused resources allocated to those workloads? You don’t have to stop using Karpenter – but you do need something besides just that one very cool tool. Workload optimizers like the Vertical Pod Autoscaler exist, but as with Karpenter’s node consolidation, fully benefiting from them incurs pod disruption (and will still do so, to some degree, even if the recently-added alpha-release feature for in-place pod resizing becomes generally available in the future).

For a more comprehensive solution to workload resource inefficiency, Platform9’s Elastic Machine Pool uses a fundamentally different model than single-purpose autoscalers to manage and optimize total cluster utilization: EMP creates a virtualization layer on AWS Metal instances, to then, in turn, create and manage Elastic VM (EVM) nodes that are added to your EKS clusters. EMP can then provide seamless optimization through two foundational virtualization technologies that address fundamental resource-management gaps in Kubernetes itself:

VM resource overcommitment lets EMP place EVMs with allocated, but unused, resources together on AWS Metal nodes, without the unused amount of resource requests taking actual resources away from workloads that need them.
VM live migration allows EMP to migrate pods between physical nodes as needed (by migrating the EVMs they run in), without restarting the pods as would normally be needed if this were handled within Kubernetes natively. This means EVMs can be rebalanced to different AWS Metal instances as the overall amount of infrastructure is scaled up or down to suit the changing needs of the workloads in the cluster, without the pod disruption that would normally accompany that kind of rebalancing.

Together, these allow EMP to increase the overall utilization of the EKS cluster, to an extent that solutions based on managing Kubernetes workloads and nodes alone can’t. EMP in effect becomes a cost-optimization layer below the EKS cluster, reducing EKS spend without EKS itself being involved in (or even aware of) EMP doing so.

If we stretch our “attendees at a conference” metaphor some, new space is opened up as the number of attendees gets too large for the existing room – but most of it is actually a hologram, which is replaced by more and more actual open space gradually, as needed.

Of course, if you’ve got some particular need to run specific workloads on EC2 instances, or just want to take your time migrating some or all of your workloads from using Karpenter to using EMP, you can: EMP will not interfere with Karpenter’s management of the EC2 nodes it controls, and Karpenter likewise won’t cause issues with EMP’s EVMs. You can even easily enable EMP only for pods in chosen namespaces.

If you’re interested in getting started with EMP, let us know! Signup is easy, and so is getting started with your first Elastic Machine Pool.

Additional reading and reference

Previously in this series:

Kubernetes FinOps: Right-sizing Kubernetes workloads

Documentation

Author
Recent Posts

Joe Thompson

Technical Product Marketing Manager at Platform9

Joe has almost 30 years of experience in IT operations and architecture, from helping out at small Internet Service Providers to building large-scale data-analysis facilities. Since 2013, he’s been focused on cloud-native technologies like Kubernetes and OpenStack -- both the advantages of using them, and the operational and administrative challenges they can present.

KubeCon 2023 Through Platform9’s Lens: Key Takeaways and Innovative Demos

By Platform9

Accelerating Cloud Native Adoption with Cost Focus

By Kamesh Pemmaraju

Categorized within: Platform9 General Tags: eks cluster, elastic machine pool, karpenter