Nodelet Phases Stuck At Master Node Due to CA Certificate Issue, Which In Turn Affected All worker nodes being NotReady State
Problem
Identified multiple issues while performing a Kubernetes cluster upgrade from v1.24 to v1.25.
- While upgrading the cluster, the nodelet phase got stuck at the etcd phase on the master node where the upgrade was started.
{"log":"{\"level\":\"warn\",\"ts\":\"2024-04-24T08:16:49.171922Z\",\"caller\":\"embed/config_logging.go:169\",\"msg\":\"rejected connection\",\"remote-addr\":\"10.96.8.51:58162\",\"server-name\":\"\",\"error\":\"tls: failed to verify certificate: x509: certificate signed by unknown authority\"}
To address the issue, a PMK stack restart was performed on all master nodes. However, as an after-effect, all worker nodes transitioned to the NotReady state following the master node upgrade/CA chain rotation after the stack restart.
E0424 09:18:50.382151 1746680 kubelet.go:2424] "Error getting node" err="node \"kube-837943-zone1-worker28\" not found"
E0424 09:18:53.106797 1746680 kubelet_node_status.go:92] "Unable to register node with API server" err="Unauthorized" node="kube-837943-zone1-worker28"
- The existing kubeconfig with the previous CA certificate becomes invalid after the cluster is upgraded.
Environment
- Platform9 Managed Kubernetes - v5.9
- Kubernetes Version 1.24+
Cause
- The issue stemmed from the pf9-kube code, which failed to utilize the entire CA chain after certificate rotation for generating certs. This oversight wasn't detected during testing, primarily due to a missing step of restarting the management plane service(Qbert) pod after initiating the certificate rotation.
- After Cluster CA rotation, the Old CA Cert is not available in the CA chain.
Resolution
- The issue has been resolved in PMK version 5.10 with Kubernetes version 1.25 or later
Workaround
- To unblock restart the nodelet phases on the unaffected master nodes first, and then restart the phases on the affected master node.
# systemctl stop pf9-hostagent pf9-nodeletd
# /opt/pf9/nodelet/nodeletd phases stop
# systemctl start pf9-hostagent
- To recover from the NotReady state, proceed with the upgrades of the worker nodes. After the upgrade, the nodes that were previously NotReady will transition to a Ready state, as they have been upgraded to the required version
Was this page helpful?