# Node NotReady With Error "container runtime is down, PLEG is not healthy"

## Problem

* A Kubernetes node is in a "NotReady" state.
* The Kubelet process on the corresponding node is in a defunct state.

{% tabs %}
{% tab title="None" %}

```none
$ ps aux | grep kubelet
root    25606 20.0  0.0      0    0 ?        Zsl  XXXXX 16194:41 [kubelet] <defunct>​
```

{% endtab %}
{% endtabs %}

* The node is also exhibiting a high load average (comparative to the # of CPUs), as observed via **`top`** or via the *pf9-muster* log.

{% tabs %}
{% tab title="/var/log/pf9/muster.log" %}

```none
[host] instances:0 loadavg:173.62 proc_active:16 proc_total:4218
```

{% endtab %}
{% endtabs %}

* In the Kubelet log, it is observed that the "container runtime is down, and that PLEG is not healthy".

{% tabs %}
{% tab title="/var/log/pf9/kubelet/kubelet.INFO" %}

```none
581159  25606 kubelet_node_status.go:430] Recording NodeNotReady event message for node <IP>I1110 XX:XX:XX.581176  25606 setters.go:518] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:YYYY-YY-YY XX:XX:XX.581137659 +0530 IST m=+4777925.197815516 LastTransitionTime:YYYY-YY-YY XX:XX:XX.581137659 +0530 IST m=+4777925.197815516 Reason:KubeletNotReady Message:container runtime is down,PLEG is not healthy: pleg was last seen active 21m34.729124335s ago; threshold is 3m0s}
```

{% endtab %}
{% endtabs %}

* In the Docker log, a "broken pipe" error is observed.

{% tabs %}
{% tab title="/var/log/pf9/kube/docker.log" %}

```none
<host> dockerd[20998]: time="YYYY-YY-YYTXX:XX:XX.510007829+05:30" level=error msg="Handler for GET /v1.31/containers/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"
```

{% endtab %}
{% endtabs %}

## Environment

* Platform9 Managed Kubernetes - v3.6.0 and Higher
* Docker

## Cause

In every iteration, the PLEG health check calls `docker ps` to detect container states changes and `docker inspect` to get the details of those containers. After finishing each iteration, it updates a timestamp. If the timestamp hasn't been updated for 3 minutes, the health check fails.

In most common occurrences of such issues, PLEG could not finish doing all the tasks in 3 minutes in turn causing Docker socket connection errors which likely result in the *pf9-kubelet* service to eventually enter a defunct state.

{% hint style="warning" %}
**Warning**

In scenarios where the Cluster Node is flapping between Ready/NotReady state due to PLEG issues accounted due to high load average, the *pf9-nodelet* service does continuously monitors the health of the *pf9-kubelet* service via phase *Configure and start kubelet*. If there is an issue with the service, *pf9-nodelet* service will try and perform a restart of that phase. But from a long term perspective, investigating the cause of high load average would be beneficial from preventing the Node to flap.
{% endhint %}

## Resolution

The defunct process cannot be cleared aside from rebooting the node.

1. Reboot the node.

{% tabs %}
{% tab title="None" %}

```none
# shutdown -r now
```

{% endtab %}
{% endtabs %}

## Additional Information

Use the below script to identify the time taken to inspect containers:

{% tabs %}
{% tab title="Containerd" %}

```docker
# TIMEFORMAT=%R; time (/opt/pf9/pf9-kube/bin/crictl -r unix:///run/containerd/containerd.sock ps | grep -v POD | awk '{print $1, $7}') | while read id name; do echo -e "<br>Checking Container: $name : $id"; RESP=$(time /opt/pf9/pf9-kube/bin/crictl -r unix:///run/containerd/containerd.sock inspect $id 2>&1  > /dev/null); echo -e "Took$RESP above secs for $name ID: $id <br>"; done; echo -e "Total Time"
```

{% endtab %}
{% endtabs %}

{% tabs %}
{% tab title="Docker" %}

```docker
# TIMEFORMAT=%R; time docker ps --format "{{.ID}}\t{{.Names}}" | while read id name; do echo -e "<br>Checking Container: $name : $id"; RESP=$(time docker inspect $id 2>&1  > /dev/null); echo -e "Took$RESP above secs for $name ID: $id <br>"; done; echo -e "Total Time"
```

{% endtab %}
{% endtabs %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://platform9.com/kb/pmk/solutions/node-not-ready-with-error-container-runtime-is-down-pleg-is-not.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
