How to Restore Cluster from ETCD Backup on New Master Nodes in Case of Loss of All Existing Master N

Problem

How to restore etcd backup on new master nodes in case of loss of all existing master nodes in the cluster?

Environment

  • Platform9 Managed Kubernetes - All Versions

  • Docker or Containerd

  • ETCD

Procedure

circle-exclamation
  1. Use Force Remove in the Management Plane UI to deauthorize and remove the old master nodes from Management Plane. Note: For PMK v5.3 and below please contact Platform9 Support to Force Remove the nodes from Management Plane.

  2. Create new master nodes with the same IP address and hostname as that of the removed master nodes.

  3. Onboard the nodes to Management Plane using either the Platform9 CLIarrow-up-right or through downloading and installing the Platform9 HostAgent manually.

  4. Once nodes comes online, first add the same active master(leader) to the cluster using below API.

circle-info

IMPORTANT

Get TOKEN, PROJECT_ID by following steps mentioned in Keystone Identityarrow-up-right. And the NODE_UUID and CLUSTER_ID can be found from the Management Plane UI by enabling UUID in columns.

  1. Once the master node is attached, the cluster looks healthy with a single master node.

  1. Copy the etcd backup to the Master node and copy etcdctl binary.

  1. Stop the PMK stack and move the current etcd directory to some other path on the Master node.

  1. Perform etcd DB restore from the master node.

  1. Once restore is complete, start PMK stack.

  1. Check the etcd cluster health status post-restoration.

  • For cluster running pf9-kube v1.19

  • For cluster running pf9-kube v1.20

  1. Delete the stale master nodes which are in NotReady state using kubectl.

  1. Scale the master nodes one by one from the Management Plane UI and verify the etcd cluster health using Step 10.

circle-info

Note

Restoring from the etcd backup is a complicated process, if not familiar with it we recommend involving Platform9 Support for any assistance before initiating the restore operation.

Last updated