# Nodelet Phases Stuck At Master Node Due to CA Certificate Issue, Which In Turn Affected All worker n

## Problem

Identified multiple issues while performing a Kubernetes cluster upgrade from **v1.24 to v1.25.**

* While upgrading the cluster, the nodelet phase got stuck at the etcd phase on the master node where the upgrade was started.

{% tabs %}
{% tab title="ETCD Logs" %}

```javascript
{"log":"{\"level\":\"warn\",\"ts\":\"2024-04-24T08:16:49.171922Z\",\"caller\":\"embed/config_logging.go:169\",\"msg\":\"rejected connection\",\"remote-addr\":\"10.96.8.51:58162\",\"server-name\":\"\",\"error\":\"tls: failed to verify certificate: x509: certificate signed by unknown authority\"}
```

{% endtab %}
{% endtabs %}

To address the issue, a PMK stack restart was performed on all master nodes. However, as an after-effect, all worker nodes transitioned to the NotReady state following the master node upgrade/CA chain rotation after the stack restart.

{% tabs %}
{% tab title="Kubelet Logs" %}

```javascript
E0424 09:18:50.382151 1746680 kubelet.go:2424] "Error getting node" err="node \"kube-837943-zone1-worker28\" not found"
E0424 09:18:53.106797 1746680 kubelet_node_status.go:92] "Unable to register node with API server" err="Unauthorized" node="kube-837943-zone1-worker28"
```

{% endtab %}
{% endtabs %}

* The existing kubeconfig with the previous CA certificate becomes invalid after the cluster is upgraded.

## Environment

* Platform9 Managed Kubernetes - v5.9
* Kubernetes Version 1.24+

## Cause

* The issue stemmed from the pf9-kube code, which failed to utilize the entire CA chain after certificate rotation for generating certs. This oversight wasn't detected during testing, primarily due to a missing step of restarting the management plane service(Qbert) pod after initiating the certificate rotation.
* After Cluster CA rotation, the Old CA Cert is not available in the CA chain.

## Resolution

* The issue has been resolved in **PMK version 5.10 with Kubernetes version 1.25 or later**

## Workaround

* To unblock restart the nodelet phases on the unaffected master nodes first, and then restart the phases on the affected master node.

{% tabs %}
{% tab title="Phases restart" %}

```javascript
# systemctl stop pf9-hostagent pf9-nodeletd 
# /opt/pf9/nodelet/nodeletd phases stop 
# systemctl start pf9-hostagent
```

{% endtab %}
{% endtabs %}

* To recover from the NotReady state, proceed with the upgrades of the worker nodes. After the upgrade, the nodes that were previously NotReady will transition to a Ready state, as they have been upgraded to the required version


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://platform9.com/kb/pmk/solutions/cluster-upgrade-failures-due-to-certificate-issues.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
