Node Unable to Join The Cluster After Reboot While Pf9-vault Pod Is Down in LTS2 Setup.
Problem
In LTS2 workload cluster, after rebooting worker/master node, the nodes are unable to join the cluster. On the affected node, the nodeletd phase is stuck in first phase - Certificate Generation. There are failed pods in the Management Cluster including Sunpike and Pf9-vault pods.
The nodeletd phase is stuck in cert generation since the communication of the node with pf9-vault service in the management plane for the cert generation is failing.
INDEX NUMBER FILE NAME PHASE STATUS
1 Generate certs / Send signing request to CA failed
2 Prepare configuration
3 Configure Container Runtime
4 Start Container Runtime
5 Network configuration
6 Configure CNI plugin
7 Miscellaneous scripts and checks
8 Configure and start kubelet
9 Configure and start kube-proxy
10 Wait for k8s services and network to be up
11 Apply and validate node taints
12 Apply dynamic kubelet configuration
13 Uncordon node
14 Drain all pods (stop only operation)
15 Configure and start monitoring
Platform9 Kubernetes stack is not running
Environment
- Platform9 Edge Cloud - v5.6 and Higher
Answer
- Observed that the communication between the pf9-vault and the Percona database is failing and the cause is yet to be identified.
- This obstructs the Sunpike pods from coming up which affects the cert generation phase of Nodelet after reboot and when the node tries to join back the cluster.
- The above behaviour was from a limitation in the product's architecture. However, this is fixed in SMCP-5.10+ releases. The JIRA under which this change was tracked is AIR-1097. Another JIRA got fixed as a result of correcting this behaviour, AIR-1091.
Diagnosis
Check the status of the PF9-Sunpike pods, if they are in CrashLoopBackOff state.
$ kubectl get pods -n aks|grep CrashLoopBackOff
pf9-dex-d8f48754c-kvntc 0/1 CrashLoopBackOff 5 3m13s
sunpike-apiserver-b9585b59f-4hj7z 2/3 CrashLoopBackOff 4 3m20s
sunpike-conductor-6c45ccfd88-r4dtl 4/5 CrashLoopBackOff 5 4m27s
sunpike-controller-manager-9d554dc74-cckxz 1/2 CrashLoopBackOff 5 4m17s
sunpike-kube-controllers-b4fbdbcf8-vc9sl 1/2 CrashLoopBackOff 5 4m6s
Pf9-vault pod log:
2023-05-26T15:01:27.743Z [ERROR] secrets.system.system_63b63fbf: error occurred during enable mount: path=pki/ error="path is already in use at pki/"
[mysql] 2023/05/26 15:16:49 packets.go:37: read tcp 10.20.2.234:45758->10.21.2.7:3306: read: connection timed out.
2023-05-26T15:16:49.487Z [ERROR] expiration: error restoring leases: error="failed to scan for leases: list failed at path \"\": failed to execute statement: invalid connection"
2023-05-26T15:16:49.487Z [ERROR] core: shutting down
2023-05-26T15:16:49.487Z [INFO] core: marked as sealed
2023-05-26T15:16:49.487Z [INFO] core: pre-seal teardown starting
2023-05-26T15:16:49.487Z [INFO] rollback: stopping rollback manager
2023-05-26T15:16:49.488Z [INFO] core: pre-seal teardown complete
2023-05-26T15:16:49.488Z [INFO] core: stopping cluster listeners
2023-05-26T15:16:49.488Z [INFO] core.cluster-listener: forwarding rpc listeners stopped
2023-05-26T15:16:49.796Z [INFO] core.cluster-listener: rpc listeners successfully shut down
2023-05-26T15:16:49.796Z [INFO] core: cluster listeners successfully shut down
2023-05-26T15:16:49.796Z [INFO] core: vault is sealed
Workaround
Restarting pf9-vault pod should re-initiate the connection between the Pf9-vault pod and Percona DB, resulting the sunpike pods to be active, and the cert generation should succeed in the Nodeletd cert generation phase.
Additional Information
After the node reboot and nodeleted phases restart, the node is expected to use the existing Kubernetes certificates on the node, for some reason the certs are missing on this node, which is the reason why node has to reach the pf9-vault service in the management plane for certificate generation.
In an ideal case, the node reboot does not need any management plane communication to join back the cluster.