How to Troubleshoot Node NotReady Issues
Problem
- How to troubleshoot node notready issues and what information needs to be checked/collected in case of these issues.
Environment
- Platform9 Edge Cloud - v5.3.0 and Higher.
- Platform9 Managed Kubernetes - 5.6 and Higher.
- Self Managed Cloud Platform9- v5.9 and Higher.
Diagnostic Steps
- Listed are most frequently observed issues/causes with explanation and sample logtraces for the node NotReady issues.
- Observe the traces from the Nodeletd Phases status output from the affected node using
# /opt/pf9/nodelet/nodeletd status --verbose
- Check the /var/log/pf9/nodelet.log and try to identify the errors and patterns.
- Observe and identify the /var/log/pf9/kubelet/kubelet logs on the affected node if any of the below errors are seen and then take action accordingly.
- Unhealthy PLEG issues. Refer: https://platform9.com/kb/kubernetes/node-not-ready-with-error-container-runtime-is-down-pleg-is-not
- TLS Handshake Timeout: Connectivity issues between the kubelet and the API server.
Jan 17 10:12:36 node-01 kubelet[12345]: E0117 10:12:36.123456 12345 kubelet.go:1845] Failed to update node status, error: Get "https://10.0.0.1:6443/api/v1/nodes/node-01?timeout=10s": net/http: TLS handshake timeout
- Eviction Manager Errors: Node is unable to retrieve resource usage stats, possibly due to connectivity issues or resource exhaustion.
Jan 17 10:12:38 node-01 kubelet[12345]: W0117 10:12:38.654321 12345 eviction_manager.go:344] eviction manager: failed to get summary stats: failed to get node stats: Get "http://127.0.0.1:10255/stats/summary?only_cpu_and_memory=true": dial tcp 127.0.0.1:10255: connect: connection refused
- Container Runtime Down: The container runtime (e.g., Docker or containerd) may not be running or responding.
Jan 17 10:12:52 node-01 kubelet[12345]: W0117 10:12:52.345678 12345 node_status.go:107] Node "node-01" not ready: "KubeletNotReady" - container runtime is down
- Networking Issues: Errors with the CNI (Container Network Interface) plugin, such as IP exhaustion or configuration issues.
Jan 17 10:12:50 node-01 kubelet[12345]: E0117 10:12:50.876543 12345 cni.go:333] Error adding default network: failed to allocate for range 0: no IP addresses available in range set: 10.244.1.0/24
- Pod Sync Failures: Specific pods (e.g., CoreDNS) failing to start, often due to broader node health issues.
Jan 17 10:12:45 node-01 kubelet[12345]: E0117 10:12:45.345678 12345 pod_workers.go:948] "Error syncing pod" pod="kube-system/coredns-5644d7b6d9-xyz" err="failed to \"StartContainer\" for \"coredns\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=coredns pod=coredns-5644d7b6d9-xyz_kube-system(e5a834ad-4d12-11ed-9e2f-0242ac120003)\"
- There could be other factors as well which can mark a node in NotReady state in those cases as well, the kubelet and nodelet logs prove useful
Additional Information
- It is recommended that before rebooting the node (if done at all) to resolve the issue, all the necessary logs are captured. Specifically a tarball of the directory
/var/log/pf9
. - Collect the
/tmp/cluster-dump.tar.gz
file generated using below commands during the time of issue to identify the root cause- if the issue is in the kubernetes level.
# kubectl cluster-info dump -n kube-system -o yaml --output-directory=/tmp/cluster/cluster-dump
# tar -czvf /tmp/cluster-dump.tar.gz /tmp/cluster/cluster-dump
- Collect kube-system and impacted application namespaces related information during the time of issue, which will be helpful to narrow down issues in the hight level:
# kubectl get all -n kube-system -owide
- In many cases the node may also self-heal and in those cases referring to the same kubelet logs should help determine the root cause of the problem.
- If are able to find the traces that the cause node being in NotReady is due to any issue with the Platform9 managed components, please reach out to the support team at https://support.platform9.com/hc/en-us with the below details:
- Initial findings including the errors and logs.
- Timestamp of issue.
- Cluster dump during the time of issue.
- Output of
# kubectl get all -n kube-system -owide
- Support bundle from the Affected nodes and Primary master node.
Related knowledge bases for reference:
- Node NotReady With Error "container runtime is down, PLEG is not healthy"
- Kubernetes Worker Node Reporting NotReady post Kubelet Service Restart.
- Worker Nodes in NotReady State as Master VIP is Unreachable
- Node NotReady With Error "container runtime is down, PLEG is not healthy"
- Worker Nodes in NotReady due to System OOM Encountered
- Kubernetes Node in NotReady State After Reboot for Containerd Runtime Cluster
- Kubernetes Master Node in NotReady State With Message "cni plugin not initialized"
- Node went into NotReady state after updating CPU manager policy
- Node in NotReady state and nodelet phases stuck at "Configure and start kube-proxy" stage
Was this page helpful?