# How to Safely Shutdown and Restart a Kubernetes Cluster

## Problem

Cluster administrators must follow specific steps to safely shut down and restart a Kubernetes cluster. This process is essential to avoid data loss and ensure a smooth recovery of cluster services.

## Environment

* Platform9 Managed Kubernetes - All Versions

## Answer

## 1. Shutdown Cluster

**Step 1:** **Pre-shutdown Checks**

{% hint style="info" %}
**Info**

**Document the Process:** Record the shutdown and restart process, including any issues encountered and how they were resolved. This documentation will be invaluable for future maintenance or recurring issues.
{% endhint %}

* **Check Cluster Status:** Before proceeding with the shutdown, ensure that the cluster is in a healthy state. Use `kubectl get nodes` and `kubectl get pods --all-namespaces` to check for any existing issues.
* **Notify Stakeholders:** Inform all stakeholders about the planned maintenance to avoid unexpected disruptions.
* **Validate Backup Integrity:** Confirm the integrity and accessibility of the etcd backup and any other critical data backups.

**Step 2:** **Backup etcd Data**

The first and most crucial step is to back up etcd data. This backup is essential for recovery in case issues arise during the cluster restart process. Use the following command to save an etcd snapshot, and ensure to move the backup file to a safe location outside the cluster:

{% tabs %}
{% tab title="Bash" %}

```bash
/opt/pf9/pf9-kube/bin/etcdctl snapshot save </path/to/backup.db> --cacert=/etc/pf9/kube.d/certs/etcdctl/etcd/ca.crt --cert=/etc/pf9/kube.d/certs/etcdctl/etcd/request.crt --key=/etc/pf9/kube.d/certs/etcdctl/etcd/request.key --endpoints=https://<NODE_IP>:4001
```

{% endtab %}
{% endtabs %}

**Step 3:** **Scale Down Workloads**

Gracefully terminate all pods within each deployment or StatefulSet by scaling them down to zero. Avoid scaling down critical system pods, such as DNS or CNI, to maintain cluster functionality.

{% tabs %}
{% tab title="Bash" %}

```bash
kubectl scale deployment/statefulsets <deployment/statefulset-name> -n <name_space> --replicas=0
```

{% endtab %}
{% endtabs %}

**Step 4:** **Cordon Worker Nodes**

Make all nodes unschedulable to prevent new pods from being assigned to them.

{% tabs %}
{% tab title="Bash" %}

```bash
kubectl cordon <node-name>
```

{% endtab %}
{% endtabs %}

**Step 5:** **Shut Down Worker Nodes**

Proceed to shut down the worker nodes. This step can be performed in parallel across all nodes.

**Step 6: Stop Services on Master Nodes**

On each master node, stop the following services:

{% tabs %}
{% tab title="Bash" %}

```bash
sudo systemctl stop pf9-hostagent pf9-comms pf9-nodeletd pf9-kubelet
```

{% endtab %}
{% endtabs %}

**Step 7: Stop Nodelet Phases on Master Nodes**

Execute the following command to stop nodelet phases. For the last master node, include the `--force` option:

{% tabs %}
{% tab title="Bash" %}

```bash
/opt/pf9/nodelet/nodeletd phases stop
/opt/pf9/nodelet/nodeletd phases stop --force  # For the last master node
```

{% endtab %}
{% endtabs %}

**Step 8: Shut Down Master Nodes**

Shut down the master nodes one at a time, ensuring stability throughout the process.

***

## 2. Starting the Cluster

**Step 1: Boot Up Nodes**

Start by booting up the master nodes one at a time, followed by the worker nodes. Allow sufficient time for the operating system and container runtime to boot up on each node.

**Step 2: Verify Services**

Confirm that all Platform9 services are running using the command below. All services must be listed as active and running:

{% tabs %}
{% tab title="Bash" %}

```bash
systemctl list-units --all | grep 'pf9'
```

{% endtab %}
{% endtabs %}

**Reference Output:**

{% tabs %}
{% tab title="Bash" %}

```bash
$ systemctl list-units --all | grep 'pf9'
  pf9-comms.service           loaded    active   running   Platform9 Communications Service
  pf9-hostagent.service       loaded    active   running   Platform9 Host Agent Service
  pf9-kubelet.service         loaded    active   running   Platform9 Kubelet Agent
  pf9-node-exporter.service   loaded    active   running   Platform9 node exporter
  pf9-nodeletd.service        loaded    active   running   Platform9 Kubernetes Management Agent Service
  pf9-prometheus.service      loaded    active   running   Platform9 prometheus
  pf9-sidekick.service        loaded    active   running   Platform9 Sidekick Service
```

{% endtab %}
{% endtabs %}

**Step 3: Verify Cluster Health**

Use `kubectl get nodes` to check if all nodes are marked as "Ready." This step confirms the cluster's health and operational status.

{% tabs %}
{% tab title="Bash" %}

```bash
kubectl get nodes
```

{% endtab %}
{% endtabs %}

**Step 4: Uncordon Nodes**

All nodes will initially be in a `SchedulingDisabled` state. Uncordon each node to make them schedulable again.

{% tabs %}
{% tab title="Bash" %}

```bash
kubectl uncordon <node-name>
```

{% endtab %}
{% endtabs %}

**Step 5: Scale Up Workloads**

Finally, scale up workloads to their original state or as needed.

{% tabs %}
{% tab title="Bash" %}

```bash
kubectl scale deployment/statefulsets <deployment/statefulset-name> --replicas=<desired_number_of_replicas>
```

{% endtab %}
{% endtabs %}

**Step 6: Validate Cluster Functionality**

* **System Services:** Verify that all system services, including DNS and networking, are functioning correctly. Try running test pods or services.
* **Application Workloads:** Confirm that all scaled-up deployments and stateful sets are running without errors. Check logs and status with `kubectl logs <pod-name>` and `kubectl describe pod <pod-name>`.
* **External Connectivity:** If cluster exposes services externally, validate that ingress resources and load balancers are correctly forwarding traffic.
* **Persistent Storage:** Ensure that all persistent volumes are attached and accessible by the relevant pods. This step is crucial for stateful applications.

**Step 7: Monitoring and Logs (Optional)**

* **Monitor System Logs:** Keep an eye on system logs for any errors or warnings that might indicate problems using commands like `journalctl -u <service-name>`.
* **Cluster Resource Monitoring:** Utilize [monitoring tools](https://platform9.com/docs/kubernetes/in-cluster-monitoring) (like [Grafana](https://platform9.com/docs/kubernetes/grafana)) to observe the cluster's performance and resource usage, ensuring it returns to normal operational parameters.

**Step 8: Restore Cluster from etcd Backup** **(Optional)**

* If, after attempting to restart the cluster, if encountered irrecoverable issues or data inconsistencies, restoring from an etcd backup may be necessary. This step should be considered a last resort, as it involves rolling back the cluster state to the point of the last backup.
* Please follow Platform9 Official documentation for [restoring from an etcd backup](https://platform9.com/kb/kubernetes/restore-etcd-backup-on-multi-master-cluster).

{% hint style="info" %}
**Info**

**Review and Update:** Based on the shutdown and restart experience, review and update the cluster maintenance procedures to include any new insights or corrective actions that could improve future processes.
{% endhint %}

## Additional Information

If encountered any issues during the shutdown or restart process, please contact our [contact Support](https://support.platform9.com/hc/en-us/requests/new?ticket_form_id=360000924873) for assistance. Providing details of the steps taken and the issues encountered will help expedite the resolution process.
