Restoring ETCD Backup to Recover Cluster From Loss of Quorum

Problem

ETCD cluster lost its quorum as more than half of the Master Nodes went offline.

Environment

  • Platform9 Managed Kubernetes - All Versions

  • ETCD

  • Docker or Containerd

Cause

Loss of quorum could be a result of Master nodes going offline or loss of connectivity between the master nodes resulting in unhealthy state of cluster.

# /opt/pf9/pf9-kube/bin/etcdctl member list
{"level":"warn","ts":"2023-01-14T10:06:19.730Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-x40e9b2f-bdd3-4ac5-8b2a-4026a9df34cd/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
Error: context deadline exceeded
...
...

Resolution

Etcd restore is an intricate procedure but the idea is to bring down master count to 1 and restore etcd from backup using etcdctl. Once that is done, we might need to make some manual changes so that it starts up as a new etcd cluster. Once things are back up, we increase the master count one by one i.e. attach nodes.

We encourage our enterprise users to create multi-master setups for HA and redundancy.

circle-exclamation

Follow below steps to bring the cluster in health state.

  1. SSH to one of the master node and run the below commands with the root user to set the environment variables.

  1. Check if a latest ETCD Backup is available at default path /etc/pf9/etcd-backup. The path can be different as well if changed. If no backup is available, run the below command to capture an etcd DB snapshot on the same node.

  1. Once the snapshot is captured successfully OR latest backup is found on one of the master node as mentioned in previous step, stop the PMK stack on ALL the nodes part of the cluster_._

  1. Move the etcd directory to some other path on ALL the nodes part of the cluster_._

  1. Restore the ETCD state by using below command on the master node where the snapshot was created and/or backup file is present.

    1. NODE_IP - IP of the Attached Node. Same can be found out from the "kubectl get nodes" output.

    2. NODE UUID - Node_UUID corresponds to the value of host_id found in file /etc/pf9/host_id.conf on the Node.

    3. </path/to/backupfilename> - File present at defined ETCD Backup Storage Path. Refer to Point 2 for more information.

  1. Start PMK Stack on the same node.

  1. Once the PMK stack is up, check the member list for ETCD. There should just be one member in the cluster.

  1. Detach the other master nodes from the cluster. Token can be generated by referring to Keystone Identity APIarrow-up-right.

  • In case, the master nodes are hard offline/unreachable, proceed to deauthorize the nodes as well.

  1. From Kubectl perspective, the detached master nodes will be seen to be in "NotReady" state. Delete these nodes from the cluster.

  1. Then start the pf9-hostagent service on the active master node. This will take care of starting the pf9-nodeletd service.

  • At this point, the cluster should be back up and running with a single master node. Verify the same.

  1. Scale back up the detached Master nodes from the Platform9 UI. You can perform this action by selecting the cluster in "Infrastructure" tab as shown in below image.

Once the nodes are scaled back up, they should have PMK stack running on them which will ensure ETCD members will sync amongst each other.

circle-exclamation

Last updated