How to Safely Shutdown and Restart a Kubernetes Cluster

Problem

Cluster administrators must follow specific steps to safely shut down and restart a Kubernetes cluster. This process is essential to avoid data loss and ensure a smooth recovery of cluster services.

Environment

Platform9 Managed Kubernetes - All Versions

Document the Process: Record the shutdown and restart process, including any issues encountered and how they were resolved. This documentation will be invaluable for future maintenance or recurring issues.

Check Cluster Status: Before proceeding with the shutdown, ensure that the cluster is in a healthy state. Use kubectl get nodes and kubectl get pods --all-namespaces to check for any existing issues.
Notify Stakeholders: Inform all stakeholders about the planned maintenance to avoid unexpected disruptions.
Validate Backup Integrity: Confirm the integrity and accessibility of the etcd backup and any other critical data backups.

Step 2: Backup etcd Data

The first and most crucial step is to back up etcd data. This backup is essential for recovery in case issues arise during the cluster restart process. Use the following command to save an etcd snapshot, and ensure to move the backup file to a safe location outside the cluster:

Bash
    
/opt/pf9/pf9-kube/bin/etcdctl snapshot save </path/to/backup.db> --cacert=/etc/pf9/kube.d/certs/etcdctl/etcd/ca.crt --cert=/etc/pf9/kube.d/certs/etcdctl/etcd/request.crt --key=/etc/pf9/kube.d/certs/etcdctl/etcd/request.key --endpoints=https://<NODE_IP>:4001
Copy

Step 3: Scale Down Workloads

Gracefully terminate all pods within each deployment or StatefulSet by scaling them down to zero. Avoid scaling down critical system pods, such as DNS or CNI, to maintain cluster functionality.

Bash
    
 
kubectl scale deployment/statefulsets <deployment/statefulset-name> -n <name_space> --replicas=0
Copy

Step 4: Cordon Worker Nodes

Make all nodes unschedulable to prevent new pods from being assigned to them.

Bash
    
 
kubectl cordon <node-name>
Copy

Step 5: Shut Down Worker Nodes

Proceed to shut down the worker nodes. This step can be performed in parallel across all nodes.

Step 6: Stop Services on Master Nodes

On each master node, stop the following services:

Bash
    
 
sudo systemctl stop pf9-hostagent pf9-comms pf9-nodeletd pf9-kubelet
Copy

Step 7: Stop Nodelet Phases on Master Nodes

Execute the following command to stop nodelet phases. For the last master node, include the --force option:

Bash
    
 
/opt/pf9/nodelet/nodeletd phases stop/opt/pf9/nodelet/nodeletd phases stop --force  # For the last master node
Copy

Step 8: Shut Down Master Nodes

Shut down the master nodes one at a time, ensuring stability throughout the process.

2. Starting the Cluster

Step 1: Boot Up Nodes

Start by booting up the master nodes one at a time, followed by the worker nodes. Allow sufficient time for the operating system and container runtime to boot up on each node.

Step 2: Verify Services

Confirm that all Platform9 services are running using the command below. All services must be listed as active and running:

Bash
    
 
systemctl list-units --all | grep 'pf9'
Copy

Reference Output:

Bash
    
 
$ systemctl list-units --all | grep 'pf9'  pf9-comms.service           loaded    active   running   Platform9 Communications Service  pf9-hostagent.service       loaded    active   running   Platform9 Host Agent Service  pf9-kubelet.service         loaded    active   running   Platform9 Kubelet Agent  pf9-node-exporter.service   loaded    active   running   Platform9 node exporter  pf9-nodeletd.service        loaded    active   running   Platform9 Kubernetes Management Agent Service  pf9-prometheus.service      loaded    active   running   Platform9 prometheus  pf9-sidekick.service        loaded    active   running   Platform9 Sidekick Service
Copy

Step 3: Verify Cluster Health

Use kubectl get nodes to check if all nodes are marked as "Ready." This step confirms the cluster's health and operational status.

Bash
    
 
kubectl get nodes
Copy

Step 4: Uncordon Nodes

All nodes will initially be in a SchedulingDisabled state. Uncordon each node to make them schedulable again.

Bash
    
 
kubectl uncordon <node-name>
Copy

Step 5: Scale Up Workloads

Finally, scale up workloads to their original state or as needed.

Bash
    
 
kubectl scale deployment/statefulsets <deployment/statefulset-name> --replicas=<desired_number_of_replicas>
Copy

Step 6: Validate Cluster Functionality

System Services: Verify that all system services, including DNS and networking, are functioning correctly. Try running test pods or services.
Application Workloads: Confirm that all scaled-up deployments and stateful sets are running without errors. Check logs and status with kubectl logs <pod-name> and kubectl describe pod <pod-name>.
External Connectivity: If cluster exposes services externally, validate that ingress resources and load balancers are correctly forwarding traffic.
Persistent Storage: Ensure that all persistent volumes are attached and accessible by the relevant pods. This step is crucial for stateful applications.

Step 7: Monitoring and Logs (Optional)

Monitor System Logs: Keep an eye on system logs for any errors or warnings that might indicate problems using commands like journalctl -u <service-name>.
Cluster Resource Monitoring: Utilize monitoring tools (like Grafana) to observe the cluster's performance and resource usage, ensuring it returns to normal operational parameters.

Step 8: Restore Cluster from etcd Backup (Optional)

If, after attempting to restart the cluster, if encountered irrecoverable issues or data inconsistencies, restoring from an etcd backup may be necessary. This step should be considered a last resort, as it involves rolling back the cluster state to the point of the last backup.
Please follow Platform9 Official documentation for restoring from an etcd backup.

Review and Update: Based on the shutdown and restart experience, review and update the cluster maintenance procedures to include any new insights or corrective actions that could improve future processes.

Additional Information

If encountered any issues during the shutdown or restart process, please contact our contact Support for assistance. Providing details of the steps taken and the issues encountered will help expedite the resolution process.

Last updated on

Was this page helpful?

How to Safely Shutdown and Restart a Kubernetes Cluster

Problem

Environment

Answer

1. Shutdown Cluster

2. Starting the Cluster

Additional Information