How to Safely Shutdown and Restart a Kubernetes Cluster
Problem
Cluster administrators must follow specific steps to safely shut down and restart a Kubernetes cluster. This process is essential to avoid data loss and ensure a smooth recovery of cluster services.
Environment
- Platform9 Managed Kubernetes - All Versions
Answer
1. Shutdown Cluster
Step 1: Pre-shutdown Checks
Document the Process: Record the shutdown and restart process, including any issues encountered and how they were resolved. This documentation will be invaluable for future maintenance or recurring issues.
- Check Cluster Status: Before proceeding with the shutdown, ensure that the cluster is in a healthy state. Use
kubectl get nodes
andkubectl get pods --all-namespaces
to check for any existing issues. - Notify Stakeholders: Inform all stakeholders about the planned maintenance to avoid unexpected disruptions.
- Validate Backup Integrity: Confirm the integrity and accessibility of the etcd backup and any other critical data backups.
Step 2: Backup etcd Data
The first and most crucial step is to back up etcd data. This backup is essential for recovery in case issues arise during the cluster restart process. Use the following command to save an etcd snapshot, and ensure to move the backup file to a safe location outside the cluster:
/opt/pf9/pf9-kube/bin/etcdctl snapshot save </path/to/backup.db> --cacert=/etc/pf9/kube.d/certs/etcdctl/etcd/ca.crt --cert=/etc/pf9/kube.d/certs/etcdctl/etcd/request.crt --key=/etc/pf9/kube.d/certs/etcdctl/etcd/request.key --endpoints=https://<NODE_IP>:4001
Step 3: Scale Down Workloads
Gracefully terminate all pods within each deployment or StatefulSet by scaling them down to zero. Avoid scaling down critical system pods, such as DNS or CNI, to maintain cluster functionality.
kubectl scale deployment/statefulsets <deployment/statefulset-name> -n <name_space> --replicas=0
Step 4: Cordon Worker Nodes
Make all nodes unschedulable to prevent new pods from being assigned to them.
kubectl cordon <node-name>
Step 5: Shut Down Worker Nodes
Proceed to shut down the worker nodes. This step can be performed in parallel across all nodes.
Step 6: Stop Services on Master Nodes
On each master node, stop the following services:
sudo systemctl stop pf9-hostagent pf9-comms pf9-nodeletd pf9-kubelet
Step 7: Stop Nodelet Phases on Master Nodes
Execute the following command to stop nodelet phases. For the last master node, include the --force
option:
/opt/pf9/nodelet/nodeletd phases stop
/opt/pf9/nodelet/nodeletd phases stop --force # For the last master node
Step 8: Shut Down Master Nodes
Shut down the master nodes one at a time, ensuring stability throughout the process.
2. Starting the Cluster
Step 1: Boot Up Nodes
Start by booting up the master nodes one at a time, followed by the worker nodes. Allow sufficient time for the operating system and container runtime to boot up on each node.
Step 2: Verify Services
Confirm that all Platform9 services are running using the command below. All services must be listed as active and running:
systemctl list-units --all | grep 'pf9'
Reference Output:
$ systemctl list-units --all | grep 'pf9'
pf9-comms.service loaded active running Platform9 Communications Service
pf9-hostagent.service loaded active running Platform9 Host Agent Service
pf9-kubelet.service loaded active running Platform9 Kubelet Agent
pf9-node-exporter.service loaded active running Platform9 node exporter
pf9-nodeletd.service loaded active running Platform9 Kubernetes Management Agent Service
pf9-prometheus.service loaded active running Platform9 prometheus
pf9-sidekick.service loaded active running Platform9 Sidekick Service
Step 3: Verify Cluster Health
Use kubectl get nodes
to check if all nodes are marked as "Ready." This step confirms the cluster's health and operational status.
kubectl get nodes
Step 4: Uncordon Nodes
All nodes will initially be in a SchedulingDisabled
state. Uncordon each node to make them schedulable again.
kubectl uncordon <node-name>
Step 5: Scale Up Workloads
Finally, scale up workloads to their original state or as needed.
kubectl scale deployment/statefulsets <deployment/statefulset-name> --replicas=<desired_number_of_replicas>
Step 6: Validate Cluster Functionality
- System Services: Verify that all system services, including DNS and networking, are functioning correctly. Try running test pods or services.
- Application Workloads: Confirm that all scaled-up deployments and stateful sets are running without errors. Check logs and status with
kubectl logs <pod-name>
andkubectl describe pod <pod-name>
. - External Connectivity: If cluster exposes services externally, validate that ingress resources and load balancers are correctly forwarding traffic.
- Persistent Storage: Ensure that all persistent volumes are attached and accessible by the relevant pods. This step is crucial for stateful applications.
Step 7: Monitoring and Logs (Optional)
- Monitor System Logs: Keep an eye on system logs for any errors or warnings that might indicate problems using commands like
journalctl -u <service-name>
. - Cluster Resource Monitoring: Utilize monitoring tools (like Grafana) to observe the cluster's performance and resource usage, ensuring it returns to normal operational parameters.
Step 8: Restore Cluster from etcd Backup (Optional)
- If, after attempting to restart the cluster, if encountered irrecoverable issues or data inconsistencies, restoring from an etcd backup may be necessary. This step should be considered a last resort, as it involves rolling back the cluster state to the point of the last backup.
- Please follow Platform9 Official documentation for restoring from an etcd backup.
Review and Update: Based on the shutdown and restart experience, review and update the cluster maintenance procedures to include any new insights or corrective actions that could improve future processes.
Additional Information
If encountered any issues during the shutdown or restart process, please contact our contact Support for assistance. Providing details of the steps taken and the issues encountered will help expedite the resolution process.