Nodelet Phases Stuck At Master Node Due to CA Certificate Issue, Which In Turn Affected All worker nodes being NotReady State

Problem

Identified multiple issues while performing a Kubernetes cluster upgrade from v1.24 to v1.25.

  • While upgrading the cluster, the nodelet phase got stuck at the etcd phase on the master node where the upgrade was started.
ETCD Logs
Copy

To address the issue, a PMK stack restart was performed on all master nodes. However, as an after-effect, all worker nodes transitioned to the NotReady state following the master node upgrade/CA chain rotation after the stack restart.

Kubelet Logs
Copy
  • The existing kubeconfig with the previous CA certificate becomes invalid after the cluster is upgraded.

Environment

  • Platform9 Managed Kubernetes - v5.9
  • Kubernetes Version 1.24+

Cause

  • The issue stemmed from the pf9-kube code, which failed to utilize the entire CA chain after certificate rotation for generating certs. This oversight wasn't detected during testing, primarily due to a missing step of restarting the management plane service(Qbert) pod after initiating the certificate rotation.
  • After Cluster CA rotation, the Old CA Cert is not available in the CA chain.

Resolution

  • The issue has been resolved in PMK version 5.10 with Kubernetes version 1.25 or later

Workaround

  • To unblock restart the nodelet phases on the unaffected master nodes first, and then restart the phases on the affected master node.
Phases restart
Copy
  • To recover from the NotReady state, proceed with the upgrades of the worker nodes. After the upgrade, the nodes that were previously NotReady will transition to a Ready state, as they have been upgraded to the required version
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard