Highly Available Management Plane Recovery Guide
Introduction
This guide walks through the recovery scenarios of the SMCP Management Station in various deployment configurations.
Single Node Cluster
There is no high availability in case of a single node cluster. If a single node management plane is down, the workload clusters will lose connectivity to it.
It is recommended to always run at-least a 3 node cluster in production environments
A node is down in a 3 node cluster
No manual intervention is required in this scenario.
Kubernetes identifies the node as NotReady and after 5 minutes, it starts to reschedule the pods running on the failed node to the other active nodes. In this 5 minute window, you may notice some service disruption on the management plane if services like keystone and/or k8sniff were running on the failed node. The workload clusters will be unaffected.
Once the services have all been rescheduled onto other nodes, the management plane should become operational again.
Please make sure that the current active nodes have enough free CPU/Memory/Disk to run the pods from the failed nodes
Two nodes are down in a 3 node cluster
The management plane will not be accessible as the database services require a minimum of 2 nodes to be operational to maintain quorum. Even if one, or both, of the failed nodes are recovered and brought back up at a later point in time, there might be manual intervention required to restore Percona.
Percona Recovery:
In most cases, Percona automatically handles the recovery process when a cluster quorum failure occurs, meaning when two nodes in the cluster fail. This is achieved by setting the autoRecovery feature to true. However, there may be instances where manual intervention is required to ensure a successful recovery.
The Percona pods in the cluster are named as percona-db-pxc-db-pxc-<0,1,2>, and there is a container called pxc within each pod.
If the second node is brought back online and if the Percona pods remain stuck in the Terminating state, you can forcibly delete all the percona pods using the following command:
kubectl delete pods -n percona percona-db-pxc-db-pxc-0 percona-db-pxc-db-pxc-1 percona-db-pxc-db-pxc-2 --forceThis action will not cause any data loss as it does not affect the Persistent Volume Claims (PVCs) associated with the pods.
Now, you should start seeing the Percona pods starting up. However, in some cases, the Percona pods after booting might complain about an FULL_PXC_CLUSTER_CRASH error and may not recover automatically.
An easy way to check for this is by looking for logs in the percona-db-pxc pods (replace <0,1,2> with the right number):
kubectl logs -n percona percona-db-pxc-db-pxc-<0,1,2> -c pxc -fLook for an error like this:
This means the percona system has crashed and cannot figure out which of its pods has the latest data. To recover, you need to obtain the sequence number for each percona pod and identify the pod with the highest sequence number, indicating the presence of the latest data. The sequence number is found like the crash report. For example, from the output above:
The sequence number for this pod is 41366. Look at all the percona pods and identify the pod with the highest sequence number.
Then, execute the following command on the pod with the highest sequence number (assuming percona-db-pxc-db-pxc-1 pod in this example):
kubectl -n percona exec percona-db-pxc-db-pxc-1 -c pxc -- sh -c 'kill -s USR1 1'This command triggers the restart of the mysqld process within that pod, allowing the formation of the mysqld cluster to resume eventually. If everything is successful, the pod status should look like this:
root@test-pf9-du-host-airgap-c7-mm-2499804-940-3 ~# kubectl get pods -n perconaNAME READY STATUS RESTARTS AGEpercona-db-pxc-db-haproxy-0 2/2 Terminating 0 5h46mpercona-db-pxc-db-haproxy-1 2/2 Running 7 5h45mpercona-db-pxc-db-haproxy-2 2/2 Running 8 5h44mpercona-db-pxc-db-pxc-0 3/3 Running 1 105mpercona-db-pxc-db-pxc-1 3/3 Running 0 104mpercona-db-pxc-db-pxc-2 0/3 Pending 0 104mpercona-operator-pxc-operator-c547b4cd5-74kzs 1/1 Running 9 5h47mWait for all the containers in the percona-db-pxc-db-haproxy-* pods to be up and running (Ready) and then the recovery is successful.