How to Troubleshoot Node NotReady Issues

Problem

  • How to troubleshoot node notready issues and what information needs to be checked/collected in case of these issues.

Environment

  • Platform9 Edge Cloud - v5.3.0 and Higher.
  • Platform9 Managed Kubernetes - 5.6 and Higher.
  • Self Managed Cloud Platform9- v5.9 and Higher.

Diagnostic Steps

  • Listed are most frequently observed issues/causes with explanation and sample logtraces for the node NotReady issues.
  • Observe the traces from the Nodeletd Phases status output from the affected node using# /opt/pf9/nodelet/nodeletd status --verbose
  • Check the /var/log/pf9/nodelet.log and try to identify the errors and patterns.
  • Observe and identify the /var/log/pf9/kubelet/kubelet logs on the affected node if any of the below errors are seen and then take action accordingly.
  • Unhealthy PLEG issues. Refer: https://platform9.com/kb/kubernetes/node-not-ready-with-error-container-runtime-is-down-pleg-is-not
  • TLS Handshake Timeout: Connectivity issues between the kubelet and the API server.
Kubelet log
Copy
  • Eviction Manager Errors: Node is unable to retrieve resource usage stats, possibly due to connectivity issues or resource exhaustion.
Kubelet log
Copy
  • Container Runtime Down: The container runtime (e.g., Docker or containerd) may not be running or responding.
Kubelet log
Copy
  • Networking Issues: Errors with the CNI (Container Network Interface) plugin, such as IP exhaustion or configuration issues.
Kubelet log
Copy
  • Pod Sync Failures: Specific pods (e.g., CoreDNS) failing to start, often due to broader node health issues.
Kubelet log
Copy
  • There could be other factors as well which can mark a node in NotReady state in those cases as well, the kubelet and nodelet logs prove useful

Additional Information

  • It is recommended that before rebooting the node (if done at all) to resolve the issue, all the necessary logs are captured. Specifically a tarball of the directory /var/log/pf9 .
  • Collect the /tmp/cluster-dump.tar.gz file generated using below commands during the time of issue to identify the root cause- if the issue is in the kubernetes level.
Clusterdump
Copy
  • Collect kube-system and impacted application namespaces related information during the time of issue, which will be helpful to narrow down issues in the hight level:
Kubernetes object informations
Copy
  • In many cases the node may also self-heal and in those cases referring to the same kubelet logs should help determine the root cause of the problem.
  • If are able to find the traces that the cause node being in NotReady is due to any issue with the Platform9 managed components, please reach out to the support team at https://support.platform9.com/hc/en-us with the below details:
  1. Initial findings including the errors and logs.
  2. Timestamp of issue.
  3. Cluster dump during the time of issue.
  4. Output of# kubectl get all -n kube-system -owide
  5. Support bundle from the Affected nodes and Primary master node.
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard