Kubernetes FinOps: Resource management challenges

Kubernetes FinOps: Resource Management Challenges

In our previous blog post we covered the different ways to think of “utilization” in Kubernetes, and some of the basic metrics that are most commonly used to control application resource usage. You might have picked up that it’s a complex topic – and we really only scratched the surface there.

Background: How the Kubernetes scheduler thinks about Pods and their resources

Before the next post where we discuss optimizing cluster utilization as part of your FinOps activities, let’s talk a bit about how Kubernetes handles workload scheduling and resource management. These not only come into play when initially assigning work to nodes, but as we’ll discuss later, they affect which workloads are subject to termination earliest from nodes under pressure. The basic concepts are resource pressure, quality of service, priority, eviction and preemption.

Resource pressure

Whether the node is considered to be under pressure for a resource is determined by eviction thresholds. These are configured per resource type, and can be tuned by a cluster administrator.

Quality of service

Depending on the definition of and relationship between the various requests and limits of the containers in a pod, Kubernetes assigns one of three “quality of service” classifications to it:

BestEffort: If no containers have any CPU or memory requests or limits defined, the pod’s QoS is BestEffort. It may be scheduled on any node with any amount of resources free, and if the node then comes under resource pressure, BestEffort pods are the first to be evicted by the kubelet if the node comes under resource pressure You can think of them, in a practical sense, with requests set to zero and limits set to the amount of the resource on the entire node.

Burstable: If any containers in the pod have a CPU or memory request or limit defined, but the requests and limits are not set (or not set to the same values) on all containers, that pod’s QoS class is Burstable. It will be scheduled only to nodes that meet the requests it has in place, and it will be subject to any limits it has defined. If the node comes under pressure, no Burstable pods are evicted until all BestEffort pods have been.

Guaranteed: If all containers have CPU and memory requests and limits set, and the requests are all set to the same values as the corresponding limits, the pod QoS class is Guaranteed – that pod will not be terminated under normal circumstances unless it exceeds a limit, and if a node comes under pressure, no Guaranteed-class pods are evicted by the kubelet until all lower-priority (see below) BestEffort and Burstable pods have been. (Guaranteed pods can never be evicted solely because of resource pressure because they never exceed their resource requests until they also exceed their limits.)

Priority

Priority isn’t directly a resource-management tool, but as we’ll see, it affects resource management nonetheless. Pods can have a priorityClassName that refers to a PriorityClass which maps to a priority number, or the pod can define an explicit priority directly. Either one places the pod in a priority stack: higher-priority pods are protected from preemption by the scheduler (see below for details) until all lower-priority pods are evicted. There are default PriorityClasses that can be applied to pods critical for cluster operations, and cluster administrators can define additional PriorityClasses to provide additional granularity for controlling pod eviction.

Eviction

Once a pod is scheduled, the kubelet on the node is then responsible for ensuring its resource requirements continue to be met, initially by reclaiming unused resources on the node if necessary as the pod’s consumption increases – but if there aren’t enough reclaimable resources available, it can also involve evicting (forcibly terminating) one or more other pods, which will then go back in the scheduler queue. Pods may be evicted from a node under a number of circumstances:

If any container in a pod exceeds a configured limit on an “incompressible” resource like memory.
If the node is under resource pressure, pods which exceed their requests for incompressible resources may be evicted by the kubelet even if they haven’t hit the limit on that resource.
Pods can also be evicted explicitly as a result of API requests that select them for eviction.

Preemption

While eviction under pressure is managed by the kubelet, preemption is managed by the scheduler. If normal scheduling of a schedulable pod can’t happen, the scheduler looks for a node where enough pods can be preempted to schedule the new pod, using two primary considerations:

At a minimum, candidate pods for preemption must be of lower priority than the pod to be scheduled; some other conditions like pod affinity and anti-affinity rules are also considered when determining which pods are preemptible and which nodes are eligible to have pods preempted (for example, if the new pod has an affinity rule to a pod scheduled on the node, that pod cannot be preempted since the new pod would still not be schedulable on the node).
If preempting all the lower-priority pods on a node would still not allow the new pod to be scheduled on it, no pods on that node will be preempted. This avoids a scenario where pods are preempted needlessly – but as we’ll see shortly, it can lead to some unexpected side effects when this rule interacts with other scheduling criteria.

Resource management behaviors you might not expect

As straightforward as the above sounds, there are important some corner cases that can lead to seemingly odd behavior:

A pod will not be evicted because of resource pressure unless it exceeds one of its requests for an incompressible resource. This means that setting a low memory request for an important workload to make it easier to schedule, but leaving the limit unset so the pod doesn’t get evicted for exceeding it, can backfire by making it eligible for eviction earlier under resource pressure.
A lower-priority pod can actually block all preemption from happening on that node – if a pod is lower-priority than the pod to be scheduled, but cannot be preempted for reasons like affinity rules, then no pods will be preempted, even if enough other pods are theoretically preemptible to make the new pod schedulable.
Only Guaranteed pod containers can use static CPU allocation from a reserved CPU pool; this means you need to set a pod’s resource configuration to make it Guaranteed (as well as meeting a few other criteria) if you want that pod’s containers to have higher CPU affinity even if you wouldn’t otherwise do that.

Now, let’s look at some concrete examples.

The cluster used for the examples below has one worker node to make inducing scheduling issues easier, but you can see all of the following happen on multi-node clusters as well. If you want to follow along on a cluster of your own, we’ve included some config snippets and commands below (you’ll need kubectl installed, as well as jq for some commands if you want to filter the output.)

Example: Guaranteed pod can’t be scheduled because of Burstable pods

To start, let’s look at a scenario where we need a Guaranteed pod to run. In this scenario our cluster is running a Burstable workload that takes up most of the available CPU shares, but with no priority set because we don’t mind if it gets preempted:

$ kubectl get pods -n accounting-apps
NAME                                   READY STATUS   RESTARTS   AGE
trivial-app-burstable-6cc5b5b59-gb98b  1/1   Running  0          3s

Our new workload is an important Guaranteed set of pods that can run if some of the Burstable pods are preempted, but when we submit it, it never gets scheduled:

$ kubectl apply -f important-app-guaranteed.yaml
deployment.apps/important-app-guaranteed created

$ kubectl get pods -n accounting-apps
NAME                                       READY   STATUS    RESTARTS   AGE
trivial-app-burstable-6cc5b5b59-gb98b      1/1     Running   0          17s
important-app-guaranteed-f75d9ff49-snrxq   0/1     Pending   0          2s
important-app-guaranteed-f75d9ff49-zqk47   0/1     Pending   0          2s
important-app-guaranteed-f75d9ff49-mkcxv   0/1     Pending   0          2s

$ kubectl get events -n accounting-apps
[...]
34s Warning FailedScheduling pod/important-app-guaranteed-f75d9ff49-snrxq 0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..
[...]
34s Warning FailedScheduling pod/important-app-guaranteed-f75d9ff49-zqk47 0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..
[...]
34s Warning FailedScheduling pod/important-app-guaranteed-f75d9ff49-mkcxv 0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..
[...]

What happened? Quality of Service isn’t part of scheduling preemption – it only comes into play after successful scheduling, to determine which pods are subject to eviction. Because the pod requests more resources than are available and is the same priority as the existing nodes, it can’t be scheduled, regardless of the fact that it’s of a higher QoS class than existing workloads.

Actions to take: If your environment is resource-constrained, make sure that your workloads have the priority they need to get scheduled in the first place.

Example: Burstable pod preempting a guaranteed pod

In this scenario, we have a Deployment with three replicas that’s doing some important work:

$ kubectl get pods -n accounting-apps
NAME                                       READY   STATUS    RESTARTS   AGE
important-app-guaranteed-f75d9ff49-8cr6x   1/1     Running   0          21s
important-app-guaranteed-f75d9ff49-ztttv   1/1     Running   0          21s
important-app-guaranteed-f75d9ff49-2c4zr   1/1     Running   0          21s

We configured the pods to be Guaranteed:

$ kubectl get po -n accounting-apps -l app=important-app -o json | \
jq '[.items[]|{podName:.metadata.name,qosClass:.status.qosClass}]'

[
  {
    "podName": "important-app-guaranteed-f75d9ff49-8cr6x",
    "qosClass": "Guaranteed"
  },
  {
    "podName": "important-app-guaranteed-f75d9ff49-ztttv",
    "qosClass": "Guaranteed"
  },
  {
    "podName": "important-app-guaranteed-f75d9ff49-2c4zr",
    "qosClass": "Guaranteed"
  }
]

The running pods are reserving almost all the CPU resources on the node:

$ kubectl describe nodes
[...]
  Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                1700m (97%)   1500m (85%)
[...]

Now a new Deployment is submitted, with more CPU requested than is available on the node – but instead of being rejected, suddenly one of our Guaranteed pods is terminated:

$ kubectl create -f trivial-app-burstable.yaml
deployment.apps/trivial-app-burstable created

$ kubectl get events -n accounting-apps
LAST SEEN   TYPE      REASON              OBJECT                                          MESSAGE
[...]
116s        Normal    Killing             pod/important-app-guaranteed-f75d9ff49-2c4zr    Stopping container stress-runner
115s        Normal    Preempted           pod/important-app-guaranteed-f75d9ff49-2c4zr    Preempted by a pod on node k3s-cluster-usage-demo-3b43-0289a1-node-pool-8ad8-rgcbj

Worse, because the new pod is taking up the resources freed up by the terminated pod, our important pod can’t even be rescheduled:

$ kubectl get po -n accounting-apps
NAME                                       READY   STATUS    RESTARTS   AGE
important-app-guaranteed-f75d9ff49-8cr6x   1/1     Running   0          5m26s
important-app-guaranteed-f75d9ff49-ztttv   1/1     Running   0          5m26s
important-app-guaranteed-f75d9ff49-n64gg   0/1     Pending   0          5s
trivial-app-burstable-74d64d6487-lxvkf     1/1     Running   0          5s

$ kubectl get events -n accounting-apps
[...]
115s        Warning   FailedScheduling    pod/important-app-guaranteed-f75d9ff49-n64gg    0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..

What happened? First, because we never set a priority or priorityClassName for the important-app-guaranteed pods, they defaulted to 0, the usual cluster-wide default priority:

$ kubectl get po -n accounting-apps -l app=important-app -o json | \
jq '[.items[]|{podName:.metadata.name,podPriority:.spec.priority}]'

[
  {
    "podName": "important-app-guaranteed-f75d9ff49-8cr6x",
    "podPriority": 0
  },
  {
    "podName": "important-app-guaranteed-f75d9ff49-ztttv",
    "podPriority": 0
  },
  {
    "podName": "important-app-guaranteed-f75d9ff49-n64gg",
    "podPriority": 0
  }
]

Next, the manifest spec for the new pod was configured with a priorityClassName of system-cluster-critical, setting it to a very high priority:

trivial-app-burstable-highpriority.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: trivial-app-burstable
  namespace: accounting-apps
  labels:
    app: trivial-app
spec:
[...]
  template:
[...]
    spec:
      priorityClassName: system-cluster-critical
[...]

$ kubectl get po -n accounting-apps -l app=trivial-app -o json | \
jq '[.items[]|{podName:.metadata.name,podPriority:.spec.priority}]'

[
  {
    "podName": "trivial-app-burstable-74d64d6487-lxvkf",
    "podPriority": 2000000000
  }
]

Because preemption doesn’t consider QoS, the fact that the existing pod was Guaranteed didn’t stop it from being preempted.

Actions to take: Control who can set priority or priorityClassName for pods, and manage those parameters carefully on authorized workloads, or you could encounter an accidental (or malicious!) disruption or denial of service of your applications because of an important pod being preempted by a less-important pod with a higher priority.

Example: Pod affinity of a lower-priority pod blocking preemption of a higher-priority pod

In this scenario our cluster is running the low-priority Burstable workload from the previous example, with a cpu request of most of the available CPU, but with no priority set this time because we don’t mind if it gets preempted. It’s also running a smaller, low-priority nightly-accounting application with a QoS of BestEffort. Our new workload needs to be colocated with this application’s pod.

The spec for the nightly-accounting pod that we want our new workload to be colocated with has a priorityClassName referencing a PriorityClass we created to set the priority to 50:

priorityclass-low.yaml

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 50
globalDefault: false
description: "Use for very low-priority apps."

nightly-accounting-besteffort.yaml

[...]
spec:
  template:
    metadata:
      labels:
        app: nightly-accounting
    spec:
      priorityClassName: low-priority
[...]

Our important new workload uses the built-in system-cluster-critical PriorityClass (recall from the previous example that this sets the priority extremely high, so be careful about using it, but for things that are cluster-critical like security agents it may be justified), and has a pod affinity rule to enforce colocating this pod with a nightly-accounting pod:

important-app-guaranteed-highpriority.yaml

[...]
spec:
  template:
    spec:
      priorityClassName: system-cluster-critical
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - "nightly-accounting"
            topologyKey: kubernetes.io/hostname
[...]

Now we deploy our important new application:

$ kubectl create -f important-app-guaranteed-highpriority.yaml
deployment.apps/important-app-guaranteed created

But when we check on the progress, instead of pods getting preempted so the new one can be scheduled, no preemption happens and the new pod is stuck in the scheduling queue:

$ kubectl get po -n accounting-apps
NAME                                                    READY   STATUS    RESTARTS   AGE
nightly-accounting-besteffort-prio50-68759d6f5f-h69ct   1/1     Running   0          2m52s
trivial-app-burstable-6cc5b5b59-lsxc7                   1/1     Running   0          2m35s
important-app-guaranteed-5f77868c7c-vpg9j               0/1     Pending   0

$ kubectl get events -n accounting-apps
[...]
27s         Normal    SuccessfulCreate    replicaset/important-app-guaranteed-5f77868c7c               Created pod: important-app-guaranteed-5f77868c7c-vpg9j
27s         Warning   FailedScheduling    pod/important-app-guaranteed-5f77868c7c-vpg9j                0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes are available: 1 node(s) didn't match pod affinity rules..

What happened? For any pod on the node to be considered for preemption, the node would have to be eligible to schedule the new workload onto after all pods on it with lower priority than the new pod were preempted. But because the new workload had an affinity rule requiring it to be scheduled alongside a nightly-accounting pod, preempting that pod would make the node not eligible to schedule the new pod on. Therefore, no preemption can happen on the node to schedule the new pod and our Deployment is stuck, even though if the lowest-priority pod were preempted, there would be plenty of available CPU. (If you’re following along in your own environment, you can verify this by deleting the trivial-app-burstable deployment – the important-app-guaranteed pod will then successfully schedule and start.)

Actions to take: Be cautious about how you use affinity so that you don’t inadvertently block scheduling of important workloads because an affinity rule prevents preempting an otherwise-eligible pod to free up enough resources for the new workload. In particular, pod-affinity rules should generally only select pods of higher priority to colocate with, to avoid this scenario.

Where do you go from here?

The key takeaway is that setting pod resource requests, resource limits and priorities has an impact not only on initial scheduling, but also the handling of the individual pod, as well as side effects on the overall management of workloads during resource contention that can be complicated to reason about – and be aware there more corner cases and caveats than what we covered here!

But if your cluster utilization isn’t where it should be, what can you do about it in your FinOps work? That’s the subject of our next post – stay tuned!

Additional reading

Previously in this series:

Basics of Cluster Utilization

Kubernetes documentation:

Pod Quality of Service Classes
How Pods with resource requests are scheduled
Pod Priority and Preemption
Node-pressure Eviction
CPU Management Policies

Author
Recent Posts

Joe Thompson

Technical Product Marketing Manager at Platform9

Joe has almost 30 years of experience in IT operations and architecture, from helping out at small Internet Service Providers to building large-scale data-analysis facilities. Since 2013, he’s been focused on cloud-native technologies like Kubernetes and OpenStack -- both the advantages of using them, and the operational and administrative challenges they can present.

Kubernetes FinOps: Elastic Machine Pool(EMP) Step-by-Step guide : Part 1

By Joe Thompson

Top 8 predictions for FinOps X 2024

By Chris Jones

Categorized within: EMP, Kubernetes Tags: Cluster Utilization, FinOps, resource limits, resource requests