Restoring ETCD Backup to Recover Cluster From Loss of Quorum

Problem

ETCD cluster lost its quorum as more than half of the Master Nodes went offline.

Environment

  • Platform9 Managed Kubernetes - All Versions
  • ETCD
  • Docker or Containerd

Cause

Loss of quorum could be a result of Master nodes going offline or loss of connectivity between the master nodes resulting in unhealthy state of cluster.

Example Error
Copy

Resolution

Etcd restore is an intricate procedure but the idea is to bring down master count to 1 and restore etcd from backup using etcdctl. Once that is done, we might need to make some manual changes so that it starts up as a new etcd cluster. Once things are back up, we increase the master count one by one i.e. attach nodes.

We encourage our enterprise users to create multi-master setups for HA and redundancy.

There is no API/process in the product currently to do recovery/replacement for a single master node within the cluster.

Restoring a multi-master cluster from an ETCD backup is a complicated process. A detach operation of the affected master node via U/I or API Call followed by an attach operation for a new master node is the preferred approach.

We recommend creating a support request with Platform9 for assistance if needed before initiating any backup-restore operation for quorum loss scenario in multi-master cluster.

Follow below steps to bring the cluster in health state.

  1. SSH to one of the master node and run the below commands with the root user to set the environment variables.
Docker
Copy
  1. Check if a latest ETCD Backup is available at default path /etc/pf9/etcd-backup. The path can be different as well if changed. If no backup is available, run the below command to capture an etcd DB snapshot on the same node.
Docker
Containerd
Copy
Sample Output
Copy
  1. Once the snapshot is captured successfully OR latest backup is found on one of the master node as mentioned in previous step, stop the PMK stack on ALL the nodes part of the cluster_._
Command
Copy
Command
Copy
  1. Move the etcd directory to some other path on ALL the nodes part of the cluster_._
Command
Copy
  1. Restore the ETCD state by using below command on the master node where the snapshot was created and/or backup file is present.
    1. NODE_IP - IP of the Attached Node. Same can be found out from the "kubectl get nodes" output.
    2. NODE UUID - Node_UUID corresponds to the value of host_id found in file /etc/pf9/host_id.conf on the Node.
    3. </path/to/backupfilename> - File present at defined ETCD Backup Storage Path. Refer to Point 2 for more information.
Sample Output
Copy
Docker
Containerd
Copy
Sample output
Copy
  1. Start PMK Stack on the same node.
Command
Copy
  1. Once the PMK stack is up, check the member list for ETCD. There should just be one member in the cluster.
Docker
Containerd
Copy
Sample Output
Copy
  1. Detach the other master nodes from the cluster. Token can be generated by referring to Keystone Identity API.
API
Copy
Example
Copy
  • In case, the master nodes are hard offline/unreachable, proceed to deauthorize the nodes as well.
API
Copy
Example
Copy
  1. From Kubectl perspective, the detached master nodes will be seen to be in "NotReady" state. Delete these nodes from the cluster.
Command
Copy
Sample Output
Copy
  1. Then start the pf9-hostagent service on the active master node. This will take care of starting the pf9-nodeletd service.
Command
Copy
  • At this point, the cluster should be back up and running with a single master node. Verify the same.
  1. Scale back up the detached Master nodes from the Platform9 UI. You can perform this action by selecting the cluster in "Infrastructure" tab as shown in below image.

Once the nodes are scaled back up, they should have PMK stack running on them which will ensure ETCD members will sync amongst each other.

Sample
Copy

In case, the other master nodes were deauthorized as well, then new nodes will need to be authorized first before they can be attached. Reference: Authorize-node

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard