# How to Restore Cluster from ETCD Backup on New Master Nodes in Case of Loss of All Existing Master N

## Problem

How to restore etcd backup on new master nodes in case of loss of all existing master nodes in the cluster?

## Environment

* Platform9 Managed Kubernetes - All Versions
* Docker or Containerd
* ETCD

## Procedure

{% hint style="warning" %}
**IMPORTANT**

Make sure to have access to the etcd backup file. For working cluster the default path for the backup storage path is */etc/pf9/etcd-backup*. Also the worker nodes should be not affected and should be in running state.
{% endhint %}

1. Use *Force Remove* in the Management Plane UI to deauthorize and remove the old master nodes from Management Plane. **Note**: For PMK v5.3 and below please contact Platform9 Support to *Force Remove* the nodes from Management Plane.
2. Create new master nodes with the **same IP address and hostname** as that of the removed master nodes.
3. Onboard the nodes to Management Plane using either the [Platform9 CLI](https://platform9.com/learn/platform9-cli) or through downloading and installing the Platform9 HostAgent manually.
4. Once nodes comes online, first add the same active master(leader) to the cluster using below API.

{% hint style="info" %}
**IMPORTANT**

Get *TOKEN*, *PROJECT\_ID* by following steps mentioned in [Keystone Identity](https://platform9.com/docs/kubernetes/keystone-identity-api). And the *NODE\_UUID and CLUSTER\_ID* can be found from the Management Plane UI by enabling UUID in columns.
{% endhint %}

{% tabs %}
{% tab title="API" %}

```python
# curl -k -X POST -H "X-Auth-Token: $TOKEN" -H "Content-type: application/json" -d '[
 {
     "uuid": "<NODE_UUID>",
     "isMaster": 1
 }]' "https://<DU-FQDN>/qbert/v3/<project_id>/clusters/<cluster_id>/attach"
```

{% endtab %}
{% endtabs %}

{% tabs %}
{% tab title="Example" %}

```python
# TOKEN=gAAAAABhnQ4ykxkfDvEd5QyhZaqTvXjiysB2xt9hY7QgjpSi_mWdCUMEgbS0w-5DAOVLRTtZeHRrCql9-aWfOFidF-1mjx0yUeIuKBel2wWQ91VsOoPK4Knz9KU-59u73vqhWaD_RibU0-eReaCx7Et1OsJ6aybFcL9ug-Kp_BHTpOxrlqGIeCI

# curl -k -X POST -H "X-Auth-Token: $TOKEN" -H "Content-type: application/json" -d '[
{
    "uuid": "992e976b-0527-4359-bc28-00c697b368eb",
    "isMaster": 1
}]' "https://ajohn.platform9.net/qbert/v3/9dc12eee0b794b9a8e8f8dc90a88f7ec/clusters/601a4510-262d-4551-bf05-395667708957/attach"
```

{% endtab %}
{% endtabs %}

5. Once the master node is attached, the cluster looks healthy with a single master node.

{% tabs %}
{% tab title="Example v1.19" %}

```python
# kubectl get nodes
NAME             STATUS   ROLES    AGE     VERSION
10.128.145.204   Ready    master   4m36s   v1.19.6
10.128.145.248   Ready    worker   4m15s   v1.19.6
```

{% endtab %}
{% endtabs %}

{% tabs %}
{% tab title="Example v1.20" %}

```python
# kubectl get nodes
NAME             STATUS   ROLES    AGE     VERSION
10.128.146.162   Ready    master   6m10s   v1.20.11
```

{% endtab %}
{% endtabs %}

6. Copy the etcd backup to the Master node and copy *etcdctl* binary.

{% tabs %}
{% tab title="Docker" %}

```python
# docker cp etcd:/usr/local/bin/etcdctl /opt/pf9/pf9-kube/bin
# export PATH=$PATH:/opt/pf9/pf9-kube/bin
```

{% endtab %}
{% endtabs %}

7. Stop the PMK stack and move the current etcd directory to some other path on the Master node.

{% tabs %}
{% tab title="Command" %}

```python
# systemctl stop pf9-{hostagent,nodeletd}
# /opt/pf9/nodelet/nodeletd phases stop
# mkdir /tmp/etcd_dir_backup
# mv /var/opt/pf9/kube/etcd/* /tmp/etcd_dir_backup
```

{% endtab %}
{% endtabs %}

8. Perform etcd DB restore from the master node.

{% tabs %}
{% tab title="Docker" %}

```python
# ETCDCTL_API=3 etcdctl snapshot restore <ETCD_BACKUP.db> --data-dir /var/opt/pf9/kube/etcd/data --initial-advertise-peer-urls="https://<MASTER_IP>:2380" --initial-cluster="<MASTER_NODE_UUID>=https://<MASTER_IP>:2380" --name="<MASTER_NODE_UUID>"
{% endtab %}
{% tab language="bash" title="Containerd" %}
# /opt/pf9/pf9-kube/bin/etcdctl snapshot restore /path/to/snapshshot_file --data-dir /var/opt/pf9/kube/etcd/data --initial-advertise-peer-urls="https://<MASTER_IP>:2380" --initial-cluster="<MASTER_NODE_UUID>=https://<MASTER_IP>:2380" --name="<MASTER_NODE_UUID>"
```

{% endtab %}
{% endtabs %}

{% tabs %}
{% tab title="Example" %}

```python
# ETCDCTL_API=3 etcdctl snapshot restore /home/centos/etcd_backup.db --data-dir /var/opt/pf9/kube/etcd/data --initial-advertise-peer-urls="https://10.128.145.204:2380" --initial-cluster="992e976b-0527-4359-bc28-00c697b368eb=https://10.128.145.204:2380" --name="992e976b-0527-4359-bc28-00c697b368eb"

2021-11-23 16:08:28.229042 I | etcdserver/membership: added member 343efe1ee1440e81 [https://10.128.145.204:2380] to cluster 90c381d304967556
```

{% endtab %}
{% endtabs %}

9. Once restore is complete, start PMK stack.

{% tabs %}
{% tab title="Command" %}

```python
# systemctl start pf9-hostagent
```

{% endtab %}
{% endtabs %}

10. Check the etcd cluster health status post-restoration.

* For cluster running **pf9-kube v1.19**

{% tabs %}
{% tab title="Docker" %}

```python
# /opt/pf9/pf9-kube/bin/etcdctl --ca-file /etc/pf9/kube.d/certs/etcdctl/etcd/ca.crt --cert-file /etc/pf9/kube.d/certs/etcdctl/etcd/request.crt --key-file /etc/pf9/kube.d/certs/etcdctl/etcd/request.key cluster-health

# /opt/pf9/pf9-kube/bin/etcdctl --ca-file /etc/pf9/kube.d/certs/etcdctl/etcd/ca.crt --cert-file /etc/pf9/kube.d/certs/etcdctl/etcd/request.crt --key-file /etc/pf9/kube.d/certs/etcdctl/etcd/request.key member list
```

{% endtab %}
{% endtabs %}

{% tabs %}
{% tab title="Sample" %}

```python
# /opt/pf9/pf9-kube/bin/etcdctl --ca-file /etc/pf9/kube.d/certs/etcdctl/etcd/ca.crt --cert-file /etc/pf9/kube.d/certs/etcdctl/etcd/request.crt --key-file /etc/pf9/kube.d/certs/etcdctl/etcd/request.key cluster-health
member 343efe1ee1440e81 is healthy: got healthy result from https://10.128.145.204:4001
cluster is healthy

# /opt/pf9/pf9-kube/bin/etcdctl --ca-file /etc/pf9/kube.d/certs/etcdctl/etcd/ca.crt --cert-file /etc/pf9/kube.d/certs/etcdctl/etcd/request.crt --key-file /etc/pf9/kube.d/certs/etcdctl/etcd/request.key member list
343efe1ee1440e81: name=992e976b-0527-4359-bc28-00c697b368eb peerURLs=https://10.128.145.204:2380 clientURLs=https://10.128.145.204:4001 isLeader=true
```

{% endtab %}
{% endtabs %}

* For cluster running **pf9-kube v1.20**

{% tabs %}
{% tab title="Docker" %}

```python
# ETCDCTL_API=3 /opt/pf9/pf9-kube/bin/etcdctl member list 
# ETCDCTL_API=3 /opt/pf9/pf9-kube/bin/etcdctl endpoint health --write-out=table
# ETCDCTL_API=3 /opt/pf9/pf9-kube/bin/etcdctl endpoint status --cluster --write-out=table --endpoints=http://127.0.0.1:2379 --cacert="/etc/pf9/kube.d/certs/etcdctl/etcd/ca.crt" --cert="/etc/pf9/kube.d/certs/etcdctl/etcd/request.crt" --key="/etc/pf9/kube.d/certs/etcdctl/etcd/request.key"
{% endtab %}
{% tab language="bash" title="Containerd" %}
# /opt/pf9/pf9-kube/bin/etcdctl --cacert=/etc/pf9/kube.d/certs/etcdctl/etcd/ca.crt --cert=/etc/pf9/kube.d/certs/etcdctl/etcd/request.crt --key=/etc/pf9/kube.d/certs/etcdctl/etcd/request.key member list  -w=table

# /opt/pf9/pf9-kube/bin/etcdctl --cacert=/etc/pf9/kube.d/certs/etcdctl/etcd/ca.crt --cert=/etc/pf9/kube.d/certs/etcdctl/etcd/request.crt --key=/etc/pf9/kube.d/certs/etcdctl/etcd/request.key endpoint health --cluster  -w=table
```

{% endtab %}
{% endtabs %}

{% tabs %}
{% tab title="Example" %}

```python
# ETCDCTL_API=3 /opt/pf9/pf9-kube/bin/etcdctl member list 
b60fda6661d77461 started  70ee9439-06e8-4ad3-8ae2-b2fd5360cbd6  https://10.128.146.162:2380 https://10.128.146.162:4001       false 


# ETCDCTL_API=3 /opt/pf9/pf9-kube/bin/etcdctl endpoint health --write-out=table 
+-----------------------------+--------+-------------+-------+
|          ENDPOINT           | HEALTH |    TOOK     | ERROR |
+-----------------------------+--------+-------------+-------+
| https://10.128.146.162:4001 |   true | 22.107927ms |       |
+-----------------------------+--------+-------------+-------+
```

{% endtab %}
{% endtabs %}

11. Delete the stale master nodes which are in NotReady state using *kubectl.*

{% tabs %}
{% tab title="Example" %}

```python
# kubectl get nodes
NAME             STATUS     ROLES    AGE    VERSION
10.128.144.242   NotReady   master   136m   v1.19.6
10.128.145.204   Ready      master   136m   v1.19.6
10.128.145.248   Ready      worker   136m   v1.19.6
10.128.145.253   NotReady   master   136m   v1.19.6

# kubectl delete node 10.128.144.242 10.128.145.253
node "10.128.144.242" deleted
node "10.128.145.253" deleted
```

{% endtab %}
{% endtabs %}

12. Scale the master nodes one by one from the Management Plane UI and verify the etcd cluster health using Step 10.

{% hint style="info" %}
**Note**

Restoring from the etcd backup is a complicated process, if not familiar with it we recommend involving Platform9 Support for any assistance before initiating the restore operation.
{% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://platform9.com/kb/pmk/how-to/how-to-restore-from-etcd-backup-after-replacing-multi-master-nod.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
