Node NotReady With Error "container runtime is down, PLEG is not healthy"

Problem

  • A Kubernetes node is in a "NotReady" state.
  • The Kubelet process on the corresponding node is in a defunct state.
Copy
  • The node is also exhibiting a high load average (comparative to the # of CPUs), as observed via top or via the pf9-muster log.
/var/log/pf9/muster.log
Copy
  • In the Kubelet log, it is observed that the "container runtime is down, and that PLEG is not healthy".
/var/log/pf9/kubelet/kubelet.INFO
Copy
  • In the Docker log, a "broken pipe" error is observed.
/var/log/pf9/kube/docker.log
Copy

Environment

  • Platform9 Managed Kubernetes - v3.6.0 and Higher
  • Docker

Cause

In every iteration, the PLEG health check calls docker ps to detect container states changes and docker inspect to get the details of those containers. After finishing each iteration, it updates a timestamp. If the timestamp hasn't been updated for 3 minutes, the health check fails.

In most common occurrences of such issues, PLEG could not finish doing all the tasks in 3 minutes in turn causing Docker socket connection errors which likely result in the pf9-kubelet service to eventually enter a defunct state.

In scenarios where the Cluster Node is flapping between Ready/NotReady state due to PLEG issues accounted due to high load average, the pf9-nodelet service does continuously monitors the health of the pf9-kubelet service via phase Configure and start kubelet. If there is an issue with the service, pf9-nodelet service will try and perform a restart of that phase. But from a long term perspective, investigating the cause of high load average would be beneficial from preventing the Node to flap.

Resolution

The defunct process cannot be cleared aside from rebooting the node.

  1. Reboot the node.
Copy

Additional Information

Use the below script to identify the time taken to inspect containers:

Containerd
Copy
Docker
Copy
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard