Node Unable to Join The Cluster After Reboot While Pf9-vault Pod Is Down in LTS2 Setup.

Problem

In LTS2 workload cluster, after rebooting worker/master node, the nodes are unable to join the cluster. On the affected node, the nodeletd phase is stuck in first phase - Certificate Generation. There are failed pods in the Management Cluster including Sunpike and Pf9-vault pods.

The nodeletd phase is stuck in cert generation since the communication of the node with pf9-vault service in the management plane for the cert generation is failing.

Nodeletd Phases
    
 
INDEX NUMBER  FILE                                         NAME    PHASE STATUS  1             Generate certs / Send signing request to CA  failed  2             Prepare configuration  3             Configure Container Runtime  4             Start Container Runtime  5             Network configuration  6             Configure CNI plugin  7             Miscellaneous scripts and checks  8             Configure and start kubelet  9             Configure and start kube-proxy  10            Wait for k8s services and network to be up  11            Apply and validate node taints  12            Apply dynamic kubelet configuration  13            Uncordon node  14            Drain all pods (stop only operation)  15            Configure and start monitoringPlatform9 Kubernetes stack is not running
Copy

Environment

Platform9 Edge Cloud - v5.6 and Higher

Answer

Observed that the communication between the pf9-vault and the Percona database is failing and the cause is yet to be identified.
This obstructs the Sunpike pods from coming up which affects the cert generation phase of Nodelet after reboot and when the node tries to join back the cluster.
The above behaviour was from a limitation in the product's architecture. However, this is fixed in SMCP-5.10+ releases. The JIRA under which this change was tracked is AIR-1097. Another JIRA got fixed as a result of correcting this behaviour, AIR-1091.

Diagnosis

Check the status of the PF9-Sunpike pods, if they are in CrashLoopBackOff state.

Failed pods
    
 
$ kubectl get pods -n aks|grep CrashLoopBackOffpf9-dex-d8f48754c-kvntc                                         0/1     CrashLoopBackOff   5          3m13ssunpike-apiserver-b9585b59f-4hj7z                               2/3     CrashLoopBackOff   4          3m20ssunpike-conductor-6c45ccfd88-r4dtl                              4/5     CrashLoopBackOff   5          4m27ssunpike-controller-manager-9d554dc74-cckxz                      1/2     CrashLoopBackOff   5          4m17ssunpike-kube-controllers-b4fbdbcf8-vc9sl                        1/2     CrashLoopBackOff   5          4m6s
Copy

Pf9-vault pod log:

pf9-vault pod logs
    
2023-05-26T15:01:27.743Z [ERROR] secrets.system.system_63b63fbf: error occurred during enable mount: path=pki/ error="path is already in use at pki/"[mysql] 2023/05/26 15:16:49 packets.go:37: read tcp 10.20.2.234:45758->10.21.2.7:3306: read: connection timed out. 2023-05-26T15:16:49.487Z [ERROR] expiration: error restoring leases: error="failed to scan for leases: list failed at path \"\": failed to execute statement: invalid connection"2023-05-26T15:16:49.487Z [ERROR] core: shutting down2023-05-26T15:16:49.487Z [INFO] core: marked as sealed2023-05-26T15:16:49.487Z [INFO] core: pre-seal teardown starting2023-05-26T15:16:49.487Z [INFO] rollback: stopping rollback manager2023-05-26T15:16:49.488Z [INFO] core: pre-seal teardown complete2023-05-26T15:16:49.488Z [INFO] core: stopping cluster listeners2023-05-26T15:16:49.488Z [INFO] core.cluster-listener: forwarding rpc listeners stopped2023-05-26T15:16:49.796Z [INFO] core.cluster-listener: rpc listeners successfully shut down2023-05-26T15:16:49.796Z [INFO] core: cluster listeners successfully shut down2023-05-26T15:16:49.796Z [INFO] core: vault is sealed
Copy

Workaround

Restarting pf9-vault pod should re-initiate the connection between the Pf9-vault pod and Percona DB, resulting the sunpike pods to be active, and the cert generation should succeed in the Nodeletd cert generation phase.

Additional Information

After the node reboot and nodeleted phases restart, the node is expected to use the existing Kubernetes certificates on the node, for some reason the certs are missing on this node, which is the reason why node has to reach the pf9-vault service in the management plane for certificate generation.

In an ideal case, the node reboot does not need any management plane communication to join back the cluster.

Last updated on

Was this page helpful?