# Node Unable to Join The Cluster After Reboot While Pf9-vault Pod Is Down in LTS2 Setup.

## Problem

In LTS2 workload cluster, after rebooting worker/master node, the nodes are unable to join the cluster. On the affected node, the nodeletd phase is stuck in first phase - Certificate Generation. There are failed pods in the Management Cluster including Sunpike and Pf9-vault pods.

The nodeletd phase is stuck in cert generation since the communication of the node with pf9-vault service in the management plane for the cert generation is failing.

{% tabs %}
{% tab title="Nodeletd Phases" %}

```javascript
INDEX NUMBER  FILE                                         NAME    PHASE STATUS
  1             Generate certs / Send signing request to CA  failed
  2             Prepare configuration
  3             Configure Container Runtime
  4             Start Container Runtime
  5             Network configuration
  6             Configure CNI plugin
  7             Miscellaneous scripts and checks
  8             Configure and start kubelet
  9             Configure and start kube-proxy
  10            Wait for k8s services and network to be up
  11            Apply and validate node taints
  12            Apply dynamic kubelet configuration
  13            Uncordon node
  14            Drain all pods (stop only operation)
  15            Configure and start monitoring
Platform9 Kubernetes stack is not running
```

{% endtab %}
{% endtabs %}

## Environment

* Platform9 Edge Cloud - v5.6 and Higher

## Answer

* Observed that the communication between the pf9-vault and the Percona database is failing and the cause is yet to be identified.
* This obstructs the Sunpike pods from coming up which affects the cert generation phase of Nodelet after reboot and when the node tries to join back the cluster.
* The above behaviour was from a limitation in the product's architecture. However, this is fixed in SMCP-5.10+ releases. The JIRA under which this change was tracked is **AIR-1097**. Another JIRA got fixed as a result of correcting this behaviour, **AIR-1091**.

### **Diagnosis**

Check the status of the PF9-Sunpike pods, if they are in CrashLoopBackOff state.

{% tabs %}
{% tab title="Failed pods" %}

```javascript
$ kubectl get pods -n aks|grep CrashLoopBackOff
pf9-dex-d8f48754c-kvntc                                         0/1     CrashLoopBackOff   5          3m13s
sunpike-apiserver-b9585b59f-4hj7z                               2/3     CrashLoopBackOff   4          3m20s
sunpike-conductor-6c45ccfd88-r4dtl                              4/5     CrashLoopBackOff   5          4m27s
sunpike-controller-manager-9d554dc74-cckxz                      1/2     CrashLoopBackOff   5          4m17s
sunpike-kube-controllers-b4fbdbcf8-vc9sl                        1/2     CrashLoopBackOff   5          4m6s
```

{% endtab %}
{% endtabs %}

Pf9-vault pod log:

{% tabs %}
{% tab title="pf9-vault pod logs" %}

```javascript
2023-05-26T15:01:27.743Z [ERROR] secrets.system.system_63b63fbf: error occurred during enable mount: path=pki/ error="path is already in use at pki/"
[mysql] 2023/05/26 15:16:49 packets.go:37: read tcp 10.20.2.234:45758->10.21.2.7:3306: read: connection timed out. 
2023-05-26T15:16:49.487Z [ERROR] expiration: error restoring leases: error="failed to scan for leases: list failed at path \"\": failed to execute statement: invalid connection"
2023-05-26T15:16:49.487Z [ERROR] core: shutting down
2023-05-26T15:16:49.487Z [INFO] core: marked as sealed
2023-05-26T15:16:49.487Z [INFO] core: pre-seal teardown starting
2023-05-26T15:16:49.487Z [INFO] rollback: stopping rollback manager
2023-05-26T15:16:49.488Z [INFO] core: pre-seal teardown complete
2023-05-26T15:16:49.488Z [INFO] core: stopping cluster listeners
2023-05-26T15:16:49.488Z [INFO] core.cluster-listener: forwarding rpc listeners stopped
2023-05-26T15:16:49.796Z [INFO] core.cluster-listener: rpc listeners successfully shut down
2023-05-26T15:16:49.796Z [INFO] core: cluster listeners successfully shut down
2023-05-26T15:16:49.796Z [INFO] core: vault is sealed
```

{% endtab %}
{% endtabs %}

## Workaround

Restarting pf9-vault pod should re-initiate the connection between the Pf9-vault pod and Percona DB, resulting the sunpike pods to be active, and the cert generation should succeed in the Nodeletd cert generation phase.

## Additional Information

After the node reboot and nodeleted phases restart, the node is expected to use the existing Kubernetes certificates on the node, for some reason the certs are missing on this node, which is the reason why node has to reach the pf9-vault service in the management plane for certificate generation.

In an ideal case, the node reboot does not need any management plane communication to join back the cluster.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://platform9.com/kb/smcp/frequently-asked-questions/node-unable-to-join-the-cluster-after-reboot-while-sunpike-and-p.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
