# How to Troubleshoot Node NotReady Issues

## Problem

* How to troubleshoot node notready issues and what information needs to be checked/collected in case of these issues.

## Environment

* Platform9 Edge Cloud - v5.3.0 and Higher.
* Platform9 Managed Kubernetes - 5.6 and Higher.
* Self Managed Cloud Platform9- v5.9 and Higher.

## Diagnostic Steps

* Listed are most frequently observed issues/causes with explanation and sample logtraces for the node NotReady issues.
* Observe the traces from the Nodeletd Phases status output from the affected node using`# /opt/pf9/nodelet/nodeletd status --verbose`
* Check the /var/log/pf9/nodelet.log and try to identify the errors and patterns.
* Observe and identify the /var/log/pf9/kubelet/*kubelet* logs on the affected node if any of the below errors are seen and then take action accordingly.
* **Unhealthy PLEG issues**. Refer: <https://platform9.com/kb/kubernetes/node-not-ready-with-error-container-runtime-is-down-pleg-is-not>
* **TLS Handshake Timeout**: Connectivity issues between the kubelet and the API server.

{% tabs %}
{% tab title="Kubelet log" %}

```bash
Jan 17 10:12:36 node-01 kubelet[12345]: E0117 10:12:36.123456   12345 kubelet.go:1845] Failed to update node status, error: Get "https://10.0.0.1:6443/api/v1/nodes/node-01?timeout=10s": net/http: TLS handshake timeout
```

{% endtab %}
{% endtabs %}

* **Eviction Manager Errors**: Node is unable to retrieve resource usage stats, possibly due to connectivity issues or resource exhaustion.

{% tabs %}
{% tab title="Kubelet log" %}

```bash
Jan 17 10:12:38 node-01 kubelet[12345]: W0117 10:12:38.654321   12345 eviction_manager.go:344] eviction manager: failed to get summary stats: failed to get node stats: Get "http://127.0.0.1:10255/stats/summary?only_cpu_and_memory=true": dial tcp 127.0.0.1:10255: connect: connection refused
```

{% endtab %}
{% endtabs %}

* **Container Runtime Down**: The container runtime (e.g., Docker or containerd) may not be running or responding.

{% tabs %}
{% tab title="Kubelet log" %}

```bash
Jan 17 10:12:52 node-01 kubelet[12345]: W0117 10:12:52.345678   12345 node_status.go:107] Node "node-01" not ready: "KubeletNotReady" - container runtime is down
```

{% endtab %}
{% endtabs %}

* **Networking Issues**: Errors with the CNI (Container Network Interface) plugin, such as IP exhaustion or configuration issues.

{% tabs %}
{% tab title="Kubelet log" %}

```bash
Jan 17 10:12:50 node-01 kubelet[12345]: E0117 10:12:50.876543   12345 cni.go:333] Error adding default network: failed to allocate for range 0: no IP addresses available in range set: 10.244.1.0/24
```

{% endtab %}
{% endtabs %}

* **Pod Sync Failures**: Specific pods (e.g., CoreDNS) failing to start, often due to broader node health issues.

{% tabs %}
{% tab title="Kubelet log" %}

```bash
Jan 17 10:12:45 node-01 kubelet[12345]: E0117 10:12:45.345678   12345 pod_workers.go:948] "Error syncing pod" pod="kube-system/coredns-5644d7b6d9-xyz" err="failed to \"StartContainer\" for \"coredns\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=coredns pod=coredns-5644d7b6d9-xyz_kube-system(e5a834ad-4d12-11ed-9e2f-0242ac120003)\"
```

{% endtab %}
{% endtabs %}

* There could be other factors as well which can mark a node in NotReady state in those cases as well, the *kubelet* and *nodelet* logs prove useful

## Additional Information

* It is recommended that before rebooting the node (if done at all) to resolve the issue, all the necessary logs are captured. Specifically a tarball of the directory `/var/log/pf9` .
* Collect the `/tmp/cluster-dump.tar.gz` file generated using below commands during the time of issue to identify the root cause- if the issue is in the kubernetes level.

{% tabs %}
{% tab title="Clusterdump" %}

```javascript
# kubectl cluster-info dump -n kube-system -o yaml --output-directory=/tmp/cluster/cluster-dump
# tar -czvf /tmp/cluster-dump.tar.gz /tmp/cluster/cluster-dump
```

{% endtab %}
{% endtabs %}

* Collect kube-system and impacted application namespaces related information during the time of issue, which will be helpful to narrow down issues in the hight level:

{% tabs %}
{% tab title="Kubernetes object informations" %}

```javascript
# kubectl get all -n kube-system -owide
```

{% endtab %}
{% endtabs %}

* In many cases the node may also self-heal and in those cases referring to the same *kubelet* logs should help determine the root cause of the problem.
* If are able to find the traces that the cause node being in NotReady is due to any issue with the Platform9 managed components, please reach out to the support team at <https://support.platform9.com/hc/en-us> with the below details:

1. Initial findings including the errors and logs.
2. Timestamp of issue.
3. Cluster dump during the time of issue.
4. Output of`# kubectl get all -n kube-system -owide`
5. Support bundle from the Affected nodes and Primary master node.

### **Related knowledge bases for reference:**

* [Node NotReady With Error "container runtime is down, PLEG is not healthy"](https://platform9.com/kb/kubernetes/node-not-ready-with-error-container-runtime-is-down-pleg-is-not)
* [Kubernetes Worker Node Reporting NotReady post Kubelet Service Restart](https://platform9.com/kb/kubernetes/kubernetes-worker-node-reporting-not-ready).
* [Worker Nodes in NotReady State as Master VIP is Unreachable](https://app.developerhub.io/platform9-knowledge-base/latest/kubernetes/worker-nodes-in-notready-state-as-master-vip-is-unreachable--int)
* [Node NotReady With Error "container runtime is down, PLEG is not healthy"](https://platform9.com/kb/kubernetes/node-not-ready-with-error-container-runtime-is-down-pleg-is-not)
* [Worker Nodes in NotReady due to System OOM Encountered](https://platform9.com/kb/kubernetes/worker-nodes-in-notready-due-to-system-oom-encountered)
* [Kubernetes Node in NotReady State After Reboot for Containerd Runtime Cluster](https://platform9.com/kb/kubernetes/containerd-runtime-node-in-notready-state-after-reboot)
* [Kubernetes Master Node in NotReady State With Message "cni plugin not initialized"](https://platform9.com/kb/kubernetes/cni-plugin-not-initialized-on-master-node)
* [Node went into NotReady state after updating CPU manager policy](https://platform9.com/kb/kubernetes/node-went-into-notready-state-after-updating-cpu-manager-policy)
* [Node in NotReady state and nodelet phases stuck at "Configure and start kube-proxy" stage](https://app.developerhub.io/platform9-knowledge-base/latest/kubernetes/node-in-notready-state-and-nodelet-phases-stuck-at--configure-an)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://platform9.com/kb/smcp/frequently-asked-questions/how-to-troubleshoot-node-notready-issues.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.