Nodelet Phases Stuck At Master Node Due to CA Certificate Issue, Which In Turn Affected All worker n

Problem

Identified multiple issues while performing a Kubernetes cluster upgrade from v1.24 to v1.25.

  • While upgrading the cluster, the nodelet phase got stuck at the etcd phase on the master node where the upgrade was started.

{"log":"{\"level\":\"warn\",\"ts\":\"2024-04-24T08:16:49.171922Z\",\"caller\":\"embed/config_logging.go:169\",\"msg\":\"rejected connection\",\"remote-addr\":\"10.96.8.51:58162\",\"server-name\":\"\",\"error\":\"tls: failed to verify certificate: x509: certificate signed by unknown authority\"}

To address the issue, a PMK stack restart was performed on all master nodes. However, as an after-effect, all worker nodes transitioned to the NotReady state following the master node upgrade/CA chain rotation after the stack restart.

E0424 09:18:50.382151 1746680 kubelet.go:2424] "Error getting node" err="node \"kube-837943-zone1-worker28\" not found"
E0424 09:18:53.106797 1746680 kubelet_node_status.go:92] "Unable to register node with API server" err="Unauthorized" node="kube-837943-zone1-worker28"
  • The existing kubeconfig with the previous CA certificate becomes invalid after the cluster is upgraded.

Environment

  • Platform9 Managed Kubernetes - v5.9

  • Kubernetes Version 1.24+

Cause

  • The issue stemmed from the pf9-kube code, which failed to utilize the entire CA chain after certificate rotation for generating certs. This oversight wasn't detected during testing, primarily due to a missing step of restarting the management plane service(Qbert) pod after initiating the certificate rotation.

  • After Cluster CA rotation, the Old CA Cert is not available in the CA chain.

Resolution

  • The issue has been resolved in PMK version 5.10 with Kubernetes version 1.25 or later

Workaround

  • To unblock restart the nodelet phases on the unaffected master nodes first, and then restart the phases on the affected master node.

  • To recover from the NotReady state, proceed with the upgrades of the worker nodes. After the upgrade, the nodes that were previously NotReady will transition to a Ready state, as they have been upgraded to the required version

Last updated