Excessive Kubernetes Master Pod Restarts Due To ETCD Latency.

Problem

  • One or more "k8s-master" pods (dependent on the number of master nodes) within the kube-system namespace of a Platform9 Managed Kubernetes cluster are showing an excessive number of restarts, e.g.
Copy

ETCD logs:

Etcd log
Copy

Kube-controller log:

kube-controller log
Copy

Kube-apiserver log:

kube-api log
Copy

Nodelet log:

Nodelet log
Copy

Environment

  • Platform9 Managed Kubernetes - v5.4 and above.
  • ETCD.

Cause

  • Etcd heartbeats are timing out, resulting in frequent leader elections.
  • The kube-controller-manager and kube-scheduler container logs show etcd read timeouts due to the leader elections, resulting in the restart of these containers.

Resolution

Identifying the ETCD latency which can be caused due to slow or overloaded ETCD disk. To test ETCD latency we have two options listed below:

  1. Using FIO tool- Install fio and run the below mentioned command on the master node:
FIO
Copy
  1. Using ETCD Perf: Run the below commands in the master node:
ETCD Perf
Copy

Make sure the hardware requirements are met as per the official ETCD documentation to avoid ETCD latency issues. And make the necessary disk-level changes as recommended.

The default values of heartbeat-interval and election-timeout are 100ms and 1000ms, respectively.

For Azure, we've had to increase these values to 1000ms and 10000ms. These defaults are included in Platform9 Managed Kubernetes v4.1+.

Additional Information

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard