Node NotReady With Error "container runtime is down, PLEG is not healthy"

Problem

  • A Kubernetes node is in a "NotReady" state.

  • The Kubelet process on the corresponding node is in a defunct state.

$ ps aux | grep kubelet
root    25606 20.0  0.0      0    0 ?        Zsl  XXXXX 16194:41 [kubelet] <defunct>​
  • The node is also exhibiting a high load average (comparative to the # of CPUs), as observed via top or via the pf9-muster log.

[host] instances:0 loadavg:173.62 proc_active:16 proc_total:4218
  • In the Kubelet log, it is observed that the "container runtime is down, and that PLEG is not healthy".

581159  25606 kubelet_node_status.go:430] Recording NodeNotReady event message for node <IP>I1110 XX:XX:XX.581176  25606 setters.go:518] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:YYYY-YY-YY XX:XX:XX.581137659 +0530 IST m=+4777925.197815516 LastTransitionTime:YYYY-YY-YY XX:XX:XX.581137659 +0530 IST m=+4777925.197815516 Reason:KubeletNotReady Message:container runtime is down,PLEG is not healthy: pleg was last seen active 21m34.729124335s ago; threshold is 3m0s}
  • In the Docker log, a "broken pipe" error is observed.

<host> dockerd[20998]: time="YYYY-YY-YYTXX:XX:XX.510007829+05:30" level=error msg="Handler for GET /v1.31/containers/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"

Environment

  • Platform9 Managed Kubernetes - v3.6.0 and Higher

  • Docker

Cause

In every iteration, the PLEG health check calls docker ps to detect container states changes and docker inspect to get the details of those containers. After finishing each iteration, it updates a timestamp. If the timestamp hasn't been updated for 3 minutes, the health check fails.

In most common occurrences of such issues, PLEG could not finish doing all the tasks in 3 minutes in turn causing Docker socket connection errors which likely result in the pf9-kubelet service to eventually enter a defunct state.

circle-exclamation

Resolution

The defunct process cannot be cleared aside from rebooting the node.

  1. Reboot the node.

Additional Information

Use the below script to identify the time taken to inspect containers:

Last updated