Highly Available Management Plane Recovery Guide

Introduction

This guide walks through the recovery scenarios of the SMCP Management Station in various deployment configurations.

Single Node Cluster

There is no high availability in case of a single node cluster. If a single node management plane is down, the workload clusters will lose connectivity to it.

It is recommended to always run at-least a 3 node cluster in production environments

Node Failure in a 3-Node Cluster

In the event of a node failure within a 3-node cluster, no manual intervention is required.

Kubernetes automatically detects the node as NotReady. However, it's important to note that the failover process may take more than 5 minutes before Kubernetes begins to reschedule the pods onto other active nodes.

During this time window while pods are still being rescheduled and until services are running again, you may experience some service disruption on the management plane if critical services such as Keystone and/or K8sniff were running on the failed node. However, it's important to understand that the workload clusters themselves will remain unaffected.

Once Kubernetes has successfully rescheduled all affected services onto the remaining active nodes, the management plane should become operational again. Please note that the duration of this process can vary and may exceed 5 minutes, depending on various factors.

Please make sure that the current active nodes have enough free CPU/Memory/Disk to run the pods from the failed nodes

Two nodes are down in a 3 node cluster

The management plane will not be accessible as the database services require a minimum of 2 nodes to be operational to maintain quorum. Even if one, or both, of the failed nodes are recovered and brought back up at a later point in time, there might be manual intervention required to restore Percona.

Percona Recovery:

In most cases, Percona automatically handles the recovery process when a cluster quorum failure occurs, meaning when two nodes in the cluster fail. This is achieved by setting the autoRecovery feature to true. However, there may be instances where manual intervention is required to ensure a successful recovery.

The Percona pods in the cluster are named as percona-db-pxc-db-pxc-<0,1,2>, and there is a container called pxc within each pod.

If the second node is brought back online and if the Percona pods remain stuck in the Terminating state, you can forcibly delete all the percona pods using the following command:

Bash
Copy

This action will not cause any data loss as it does not affect the Persistent Volume Claims (PVCs) associated with the pods.

Now, you should start seeing the Percona pods starting up. However, in some cases, the Percona pods after booting might complain about an FULL_PXC_CLUSTER_CRASH error and may not recover automatically.

An easy way to check for this is by looking for logs in the percona-db-pxc pods (replace <0,1,2> with the right number):

Bash
Copy

Look for an error like this:

Copy

This means the percona system has crashed and cannot figure out which of its pods has the latest data. To recover, you need to obtain the sequence number for each percona pod and identify the pod with the highest sequence number, indicating the presence of the latest data. The sequence number is found like the crash report. For example, from the output above:

Copy

The sequence number for this pod is 41366. Look at all the percona pods and identify the pod with the highest sequence number.

Then, execute the following command on the pod with the highest sequence number (assuming percona-db-pxc-db-pxc-1 pod in this example):

Bash
Copy

This command triggers the restart of the mysqld process within that pod, allowing the formation of the mysqld cluster to resume eventually. If everything is successful, the pod status should look like this:

YAML
Copy

Wait for all the containers in the percona-db-pxc-db-haproxy-* pods to be up and running (Ready) and then the recovery is successful.

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard