# Excessive Kubernetes Master Pod Restarts Due To ETCD Latency.

## Problem

* One or more "k8s-master" pods (dependent on the number of master nodes) within the ***kube-system*** namespace of a Platform9 Managed Kubernetes cluster are showing an excessive number of restarts, e.g.

{% tabs %}
{% tab title="None" %}

```none
❯ kubectl get po -n kube-system k8s-master-172.17.0.14
NAME READY STATUS RESTARTS AGE
k8s-master-fe9d1e3a-4c43-417b-9720-c2a3d0732d9d000003 3/3 Running 119 27d
```

{% endtab %}
{% endtabs %}

ETCD logs:

{% tabs %}
{% tab title="Etcd log" %}

```javascript
{"log":"{\"level\":\"warn\",\"ts\":\"2023-11-16T22:23:08.031Z\",\"caller\":\"etcdserver/util.go:163\",\"msg\":\"apply request took too long\",\"took\":\"9.069436957s\",\"expected-duration\":\"100ms\",\"prefix\":\"read-only range \",\"request\":\"key:\\\"/registry/horizontalpodautoscalers/\\\" range_end:\\\"/registry/horizontalpodautoscalers0\\\" limit:10000 \",\"response\":\"\",\"error\":\"etcdserver: request timed out\"}<br>","stream":"stderr","time":"2023-11-16T22:23:08.031887945Z"}
```

{% endtab %}
{% endtabs %}

Kube-controller log:

{% tabs %}
{% tab title="kube-controller log" %}

```javascript
{"log":"E1116 23:58:01.449568       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: Get \"https://localhost:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)<br>","stream":"stderr","time":"2023-11-16T23:58:01.450015023Z"}
```

{% endtab %}
{% endtabs %}

Kube-apiserver log:

{% tabs %}
{% tab title="kube-api log" %}

```javascript
{"log":"E1116 23:59:06.732861       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:\"etcdserver: request timed out\"}: etcdserver: request timed out<br>","stream":"stderr","time":"2023-11-16T23:59:06.742609594Z"}
```

{% endtab %}
{% endtabs %}

Nodelet log:

{% tabs %}
{% tab title="Nodelet log" %}

```javascript
{"L":"INFO","T":"2023-11-16T17:21:19.642-0700","C":"command/command.go:120","M":"[2023-11-16 17:21:19] I1116 17:21:19.625769 3204532 request.go:1123] Response Body: {\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"etcdserver: request timed out\",\"code\":500}"}
```

{% endtab %}
{% endtabs %}

## Environment

* Platform9 Managed Kubernetes - v5.4 and above.
* ETCD.

## Cause

* Etcd heartbeats are timing out, resulting in frequent leader elections.
* The ***kube-controller-manager*** and ***kube-scheduler*** container logs show etcd read timeouts due to the leader elections, resulting in the restart of these containers.

## Resolution

Identifying the ETCD latency which can be caused due to slow or overloaded ETCD disk. To test ETCD latency we have two options listed below:

1. **Using FIO tool-** Install fio and run the below mentioned command on the master node:

{% tabs %}
{% tab title="FIO" %}

```javascript
# sudo apt-get install fio
# fio --version             ## should be higher than 3.5 at least
# sudo mkdir /var/opt/pf9/kube/etcd/data/test-data
# sudo fio --rw=write --ioengine=sync --fdatasync=1 --directory=/var/opt/pf9/kube/etcd/data/test-data --size=22m --bs=2300 --name=hk-prod-test
```

{% endtab %}
{% endtabs %}

2. **Using ETCD Perf:** Run the below commands in the master node:

{% tabs %}
{% tab title="ETCD Perf" %}

```javascript
# cd /opt/pf9/pf9-kube/bin/
# ./etcdctl check perf --load='l' 

# /opt/pf9/pf9-kube/bin/etcdctl --cacert=/etc/pf9/kube.d/certs/etcdctl/etcd/ca.crt --cert=/etc/pf9/kube.d/certs/etcdctl/etcd/request.crt --key=/etc/pf9/kube.d/certs/etcdctl/etcd/request.key check perf --load='l'
```

{% endtab %}
{% endtabs %}

Make sure the hardware requirements are met as per the official ETCD [documentation](https://etcd.io/docs/v3.5/op-guide/hardware/) to avoid ETCD latency issues. And make the necessary disk-level changes as recommended.

{% hint style="info" %}
**Info**

The default values of **heartbeat-interval** and **election-timeout** are 100ms and 1000ms, respectively.

For Azure, we've had to increase these values to 1000ms and 10000ms. These defaults are included in Platform9 Managed Kubernetes v4.1+.
{% endhint %}

## Additional Information

* [etcd docs | Tuning | Time Parameters](https://etcd.io/docs/v3.4.0/tuning/#time-parameters)
* [GitHub – kubernetes/kubernetes#74340 – kube-controller-manager restarts](https://github.com/kubernetes/kubernetes/issues/74340)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://platform9.com/kb/pmk/solutions/excessive-kubernetes-master-pod-restarts.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
