How-To Restore From ETCD Backup on a Multi-Master Cluster

Problem

How-To Restore From ETCD Backup on a Multi-Master Cluster?

Environment

  • Platform9 Managed Kubernetes - All Versions
  • ETCD
  • Docker or Containerd

Procedure

Restoring from the ETCD backup is a complicated process, if not familiar with it we recommend involving Platform9 Support for any assistance before initiating the restore operation.

The highlighted process is for a cluster consisting of 3 Master nodes.

  1. Access the Master node where you want to perform the restore operation.
Docker
Copy
  1. Copy the Snapshot file to the Master node where you will run the restore operation.
    • The ETCD Backup Storage Path and ETCD Backup Interval parameters can be configured at the time of cluster creation and can also be changed later from the Platform9 UI by editing the cluster details. The default path for the backup storage path is /etc/pf9/etcd-backup.

The latest backup file is stored on any one of the attached Master node so you will need to check the configured storage path on all of the master nodes to get the latest snapshot.

Alternative way to identify the Master node on which the latest backup is located is to run the below-mentioned command. It will provide the output of etcd-backup pods with IP of Master nodes on which it saved the etcd snapshot post job completion. Check the configured backup path on the Master node with recent etcd-backup pod in completed state.

Command
Copy
Sample Output
Copy
  1. Scale down the total number of Master nodes to 1 by detaching the Master nodes from the cluster using the Platform9 UI. You can perform this action by selecting the cluster in "Infrastructure" tab as shown in below image.
  • Verify the etcd member count once the cluster has been scaled to 1 master node. For pf9-kube v1.19 and below, having etcd v3.3.22 run the command on the available master node to get etcd cluster-health.
Docker
Copy
Sample Output
Copy
  • For pf9-kube v1.20 and above, having etcd v3.4.14 run the command on the available master node to get etcd member and endpoint information.
Docker
Containerd
Copy
Sample Output
Copy
  1. Clear any of the stale Master nodes from kubectl perspective using below command if required.
Command
Copy
Sample Output
Copy
  1. Stop the PMK stack on the Master node where we will run the restore operation. Below command will stop all the K8s related services on the attached Master node.
Command
Copy
  1. Move the etcd directory to some other path on the Master node.
Command
Copy
  1. Restore the etcd snapshot by using below-mentioned command.
  • NODE_IP - IP of the Attached Node. Same can be found out from the "kubectl get nodes" output.
  • NODE UUID- Node_UUID corresponds to the value of host_id found in file /etc/pf9/host_id.conf on the Node.
  • </path/to/backupfilename> - File present at defined ETCD Backup Storage Path. Refer to Point 2 for more information.
Sample Output
Copy
Docker
Containerd
Copy
Sample Output
Copy
  1. Start the PMK stack on the Master node.
Command
Copy
  1. Verify the etcd member count once the stack is up and running. Note: There should just be one member in the etcd cluster.
  • For pf9-kube v1.19 and below, having etcd v3.3.22 run the command on the available master node to get etcd cluster-health.
Docker
Copy
Sample Output
Copy
  • For pf9-kube v1.20 and above, run the command on the available master node to get etcd member and endpoint information.
Docker
Containerd
Copy
Sample Output
Copy
  1. From kubectl perspective, it will show the previously detached master nodes in a NotReady state. They may initially show as "Ready", but eventually they should reflect as "NotReady". To delete the stale master nodes, run the below-mentioned command.
Command
Copy
Sample Output
Copy
  1. Once the cluster status is in a healthy state you can start the pf9-hostagent service on the master node. This will eventually start the pf9-nodeletd service.
Command
Copy
  1. Make sure to there is no etcd data directory on nodes that will be used to scale up the master nodes in the cluster.

This is a required if you are planning to use the same Master nodes which we detached from the cluster in step number 3.

Command
Copy
  1. Scale up the Master nodes from the Platform9 UI. You can perform this action by selecting the cluster in "Infrastructure" tab as shown in below image.
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard