How to Troubleshoot Node NotReady Issues
Problem
How to troubleshoot node notready issues and what information needs to be checked/collected in case of these issues.
Environment
Platform9 Edge Cloud - v5.3.0 and Higher.
Platform9 Managed Kubernetes - 5.6 and Higher.
Self Managed Cloud Platform9- v5.9 and Higher.
Diagnostic Steps
Listed are most frequently observed issues/causes with explanation and sample logtraces for the node NotReady issues.
Observe the traces from the Nodeletd Phases status output from the affected node using
# /opt/pf9/nodelet/nodeletd status --verboseCheck the /var/log/pf9/nodelet.log and try to identify the errors and patterns.
Observe and identify the /var/log/pf9/kubelet/kubelet logs on the affected node if any of the below errors are seen and then take action accordingly.
Unhealthy PLEG issues. Refer: https://platform9.com/kb/kubernetes/node-not-ready-with-error-container-runtime-is-down-pleg-is-not
TLS Handshake Timeout: Connectivity issues between the kubelet and the API server.
Jan 17 10:12:36 node-01 kubelet[12345]: E0117 10:12:36.123456 12345 kubelet.go:1845] Failed to update node status, error: Get "https://10.0.0.1:6443/api/v1/nodes/node-01?timeout=10s": net/http: TLS handshake timeoutEviction Manager Errors: Node is unable to retrieve resource usage stats, possibly due to connectivity issues or resource exhaustion.
Container Runtime Down: The container runtime (e.g., Docker or containerd) may not be running or responding.
Networking Issues: Errors with the CNI (Container Network Interface) plugin, such as IP exhaustion or configuration issues.
Pod Sync Failures: Specific pods (e.g., CoreDNS) failing to start, often due to broader node health issues.
There could be other factors as well which can mark a node in NotReady state in those cases as well, the kubelet and nodelet logs prove useful
Additional Information
It is recommended that before rebooting the node (if done at all) to resolve the issue, all the necessary logs are captured. Specifically a tarball of the directory
/var/log/pf9.Collect the
/tmp/cluster-dump.tar.gzfile generated using below commands during the time of issue to identify the root cause- if the issue is in the kubernetes level.
Collect kube-system and impacted application namespaces related information during the time of issue, which will be helpful to narrow down issues in the hight level:
In many cases the node may also self-heal and in those cases referring to the same kubelet logs should help determine the root cause of the problem.
If are able to find the traces that the cause node being in NotReady is due to any issue with the Platform9 managed components, please reach out to the support team at https://support.platform9.com/hc/en-us with the below details:
Initial findings including the errors and logs.
Timestamp of issue.
Cluster dump during the time of issue.
Output of
# kubectl get all -n kube-system -owideSupport bundle from the Affected nodes and Primary master node.
Related knowledge bases for reference:
Last updated
