Highly Available Management Plane Recovery Guide
Introduction
This guide walks through the recovery scenarios of the SMCP Management Station in various deployment configurations.
Single Node Cluster
There is no high availability in case of a single node cluster. If a single node management plane is down, the workload clusters will lose connectivity to it.
It is recommended to always run at-least a 3 node cluster in production environments
Node Failure in a 3-Node Cluster
In the event of a node failure within a 3-node cluster, no manual intervention is required.
Kubernetes automatically detects the node as NotReady. However, it's important to note that the failover process may take more than 5 minutes before Kubernetes begins to reschedule the pods onto other active nodes.
During this time window while pods are still being rescheduled and until services are running again, you may experience some service disruption on the management plane if critical services such as Keystone and/or K8sniff were running on the failed node. However, it's important to understand that the workload clusters themselves will remain unaffected.
Once Kubernetes has successfully rescheduled all affected services onto the remaining active nodes, the management plane should become operational again. Please note that the duration of this process can vary and may exceed 5 minutes, depending on various factors.
Please make sure that the current active nodes have enough free CPU/Memory/Disk to run the pods from the failed nodes
Two nodes are down in a 3 node cluster
The management plane will not be accessible as the database services require a minimum of 2 nodes to be operational to maintain quorum. Even if one, or both, of the failed nodes are recovered and brought back up at a later point in time, there might be manual intervention required to restore Percona.
Percona Recovery:
In most cases, Percona automatically handles the recovery process when a cluster quorum failure occurs, meaning when two nodes in the cluster fail. This is achieved by setting the autoRecovery
feature to true
. However, there may be instances where manual intervention is required to ensure a successful recovery.
The Percona pods in the cluster are named as percona-db-pxc-db-pxc-<0,1,2>
, and there is a container called pxc
within each pod.
If the second node is brought back online and if the Percona pods remain stuck in the Terminating
state, you can forcibly delete all the percona pods using the following command:
kubectl delete pods -n percona percona-db-pxc-db-pxc-0 percona-db-pxc-db-pxc-1 percona-db-pxc-db-pxc-2 --force
This action will not cause any data loss as it does not affect the Persistent Volume Claims (PVCs) associated with the pods.
Now, you should start seeing the Percona pods starting up. However, in some cases, the Percona pods after booting might complain about an FULL_PXC_CLUSTER_CRASH
error and may not recover automatically.
An easy way to check for this is by looking for logs in the percona-db-pxc pods (replace <0,1,2> with the right number):
kubectl logs -n percona percona-db-pxc-db-pxc-<0,1,2> -c pxc -f
Look for an error like this:
This means the percona system has crashed and cannot figure out which of its pods has the latest data. To recover, you need to obtain the sequence number for each percona pod and identify the pod with the highest sequence number, indicating the presence of the latest data. The sequence number is found like the crash report. For example, from the output above:
The sequence number for this pod is 41366
. Look at all the percona pods and identify the pod with the highest sequence number.
Then, execute the following command on the pod with the highest sequence number (assuming percona-db-pxc-db-pxc-1
pod in this example):
kubectl -n percona exec percona-db-pxc-db-pxc-1 -c pxc -- sh -c 'kill -s USR1 1'
This command triggers the restart of the mysqld
process within that pod, allowing the formation of the mysqld
cluster to resume eventually. If everything is successful, the pod status should look like this:
root@test-pf9-du-host-airgap-c7-mm-2499804-940-3 ~ # kubectl get pods -n percona
NAME READY STATUS RESTARTS AGE
percona-db-pxc-db-haproxy-0 2/2 Terminating 0 5h46m
percona-db-pxc-db-haproxy-1 2/2 Running 7 5h45m
percona-db-pxc-db-haproxy-2 2/2 Running 8 5h44m
percona-db-pxc-db-pxc-0 3/3 Running 1 105m
percona-db-pxc-db-pxc-1 3/3 Running 0 104m
percona-db-pxc-db-pxc-2 0/3 Pending 0 104m
percona-operator-pxc-operator-c547b4cd5-74kzs 1/1 Running 9 5h47m
Wait for all the containers in the percona-db-pxc-db-haproxy-*
pods to be up and running (Ready
) and then the recovery is successful.