How to Troubleshoot Node NotReady Issues

Problem

How to troubleshoot node notready issues and what information needs to be checked/collected in case of these issues.

Environment

Platform9 Edge Cloud - v5.3.0 and Higher.
Platform9 Managed Kubernetes - 5.6 and Higher.
Self Managed Cloud Platform9- v5.9 and Higher.

Diagnostic Steps

Listed are most frequently observed issues/causes with explanation and sample logtraces for the node NotReady issues.
Observe the traces from the Nodeletd Phases status output from the affected node using# /opt/pf9/nodelet/nodeletd status --verbose
Check the /var/log/pf9/nodelet.log and try to identify the errors and patterns.
Observe and identify the /var/log/pf9/kubelet/kubelet logs on the affected node if any of the below errors are seen and then take action accordingly.
Unhealthy PLEG issues. Refer: https://platform9.com/kb/kubernetes/node-not-ready-with-error-container-runtime-is-down-pleg-is-not
TLS Handshake Timeout: Connectivity issues between the kubelet and the API server.

Kubelet log
    
Jan 17 10:12:36 node-01 kubelet[12345]: E0117 10:12:36.123456   12345 kubelet.go:1845] Failed to update node status, error: Get "https://10.0.0.1:6443/api/v1/nodes/node-01?timeout=10s": net/http: TLS handshake timeout
Copy

Eviction Manager Errors: Node is unable to retrieve resource usage stats, possibly due to connectivity issues or resource exhaustion.

Kubelet log
    
Jan 17 10:12:38 node-01 kubelet[12345]: W0117 10:12:38.654321   12345 eviction_manager.go:344] eviction manager: failed to get summary stats: failed to get node stats: Get "http://127.0.0.1:10255/stats/summary?only_cpu_and_memory=true": dial tcp 127.0.0.1:10255: connect: connection refused
Copy

Container Runtime Down: The container runtime (e.g., Docker or containerd) may not be running or responding.

Kubelet log
    
Jan 17 10:12:52 node-01 kubelet[12345]: W0117 10:12:52.345678   12345 node_status.go:107] Node "node-01" not ready: "KubeletNotReady" - container runtime is down
Copy

Networking Issues: Errors with the CNI (Container Network Interface) plugin, such as IP exhaustion or configuration issues.

Kubelet log
    
Jan 17 10:12:50 node-01 kubelet[12345]: E0117 10:12:50.876543   12345 cni.go:333] Error adding default network: failed to allocate for range 0: no IP addresses available in range set: 10.244.1.0/24
Copy

Pod Sync Failures: Specific pods (e.g., CoreDNS) failing to start, often due to broader node health issues.

Kubelet log
    
Jan 17 10:12:45 node-01 kubelet[12345]: E0117 10:12:45.345678   12345 pod_workers.go:948] "Error syncing pod" pod="kube-system/coredns-5644d7b6d9-xyz" err="failed to \"StartContainer\" for \"coredns\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=coredns pod=coredns-5644d7b6d9-xyz_kube-system(e5a834ad-4d12-11ed-9e2f-0242ac120003)\"
Copy

There could be other factors as well which can mark a node in NotReady state in those cases as well, the kubelet and nodelet logs prove useful

Additional Information

It is recommended that before rebooting the node (if done at all) to resolve the issue, all the necessary logs are captured. Specifically a tarball of the directory /var/log/pf9 .
Collect the /tmp/cluster-dump.tar.gz file generated using below commands during the time of issue to identify the root cause- if the issue is in the kubernetes level.

Clusterdump
    
 
# kubectl cluster-info dump -n kube-system -o yaml --output-directory=/tmp/cluster/cluster-dump# tar -czvf /tmp/cluster-dump.tar.gz /tmp/cluster/cluster-dump
Copy

Collect kube-system and impacted application namespaces related information during the time of issue, which will be helpful to narrow down issues in the hight level:

Kubernetes object informations
    
 
# kubectl get all -n kube-system -owide
Copy

In many cases the node may also self-heal and in those cases referring to the same kubelet logs should help determine the root cause of the problem.
If are able to find the traces that the cause node being in NotReady is due to any issue with the Platform9 managed components, please reach out to the support team at https://support.platform9.com/hc/en-us with the below details:

Initial findings including the errors and logs.
Timestamp of issue.
Cluster dump during the time of issue.
Output of# kubectl get all -n kube-system -owide
Support bundle from the Affected nodes and Primary master node.

Last updated on

Was this page helpful?

How to Troubleshoot Node NotReady Issues

Problem

Environment

Diagnostic Steps

Additional Information

Related knowledge bases for reference: