How to Safely Shutdown and Restart a Kubernetes Cluster

Problem

Cluster administrators must follow specific steps to safely shut down and restart a Kubernetes cluster. This process is essential to avoid data loss and ensure a smooth recovery of cluster services.

Environment

  • Platform9 Managed Kubernetes - All Versions

Answer

1. Shutdown Cluster

Step 1: Pre-shutdown Checks

Document the Process: Record the shutdown and restart process, including any issues encountered and how they were resolved. This documentation will be invaluable for future maintenance or recurring issues.

  • Check Cluster Status: Before proceeding with the shutdown, ensure that the cluster is in a healthy state. Use kubectl get nodes and kubectl get pods --all-namespaces to check for any existing issues.
  • Notify Stakeholders: Inform all stakeholders about the planned maintenance to avoid unexpected disruptions.
  • Validate Backup Integrity: Confirm the integrity and accessibility of the etcd backup and any other critical data backups.

Step 2: Backup etcd Data

The first and most crucial step is to back up etcd data. This backup is essential for recovery in case issues arise during the cluster restart process. Use the following command to save an etcd snapshot, and ensure to move the backup file to a safe location outside the cluster:

Bash
Copy

Step 3: Scale Down Workloads

Gracefully terminate all pods within each deployment or StatefulSet by scaling them down to zero. Avoid scaling down critical system pods, such as DNS or CNI, to maintain cluster functionality.

Bash
Copy

Step 4: Cordon Worker Nodes

Make all nodes unschedulable to prevent new pods from being assigned to them.

Bash
Copy

Step 5: Shut Down Worker Nodes

Proceed to shut down the worker nodes. This step can be performed in parallel across all nodes.

Step 6: Stop Services on Master Nodes

On each master node, stop the following services:

Bash
Copy

Step 7: Stop Nodelet Phases on Master Nodes

Execute the following command to stop nodelet phases. For the last master node, include the --force option:

Bash
Copy

Step 8: Shut Down Master Nodes

Shut down the master nodes one at a time, ensuring stability throughout the process.

2. Starting the Cluster

Step 1: Boot Up Nodes

Start by booting up the master nodes one at a time, followed by the worker nodes. Allow sufficient time for the operating system and container runtime to boot up on each node.

Step 2: Verify Services

Confirm that all Platform9 services are running using the command below. All services must be listed as active and running:

Bash
Copy

Reference Output:

Bash
Copy

Step 3: Verify Cluster Health

Use kubectl get nodes to check if all nodes are marked as "Ready." This step confirms the cluster's health and operational status.

Bash
Copy

Step 4: Uncordon Nodes

All nodes will initially be in a SchedulingDisabled state. Uncordon each node to make them schedulable again.

Bash
Copy

Step 5: Scale Up Workloads

Finally, scale up workloads to their original state or as needed.

Bash
Copy

Step 6: Validate Cluster Functionality

  • System Services: Verify that all system services, including DNS and networking, are functioning correctly. Try running test pods or services.
  • Application Workloads: Confirm that all scaled-up deployments and stateful sets are running without errors. Check logs and status with kubectl logs <pod-name> and kubectl describe pod <pod-name>.
  • External Connectivity: If cluster exposes services externally, validate that ingress resources and load balancers are correctly forwarding traffic.
  • Persistent Storage: Ensure that all persistent volumes are attached and accessible by the relevant pods. This step is crucial for stateful applications.

Step 7: Monitoring and Logs (Optional)

  • Monitor System Logs: Keep an eye on system logs for any errors or warnings that might indicate problems using commands like journalctl -u <service-name>.
  • Cluster Resource Monitoring: Utilize monitoring tools (like Grafana) to observe the cluster's performance and resource usage, ensuring it returns to normal operational parameters.

Step 8: Restore Cluster from etcd Backup (Optional)

  • If, after attempting to restart the cluster, if encountered irrecoverable issues or data inconsistencies, restoring from an etcd backup may be necessary. This step should be considered a last resort, as it involves rolling back the cluster state to the point of the last backup.
  • Please follow Platform9 Official documentation for restoring from an etcd backup.

Review and Update: Based on the shutdown and restart experience, review and update the cluster maintenance procedures to include any new insights or corrective actions that could improve future processes.

Additional Information

If encountered any issues during the shutdown or restart process, please contact our contact Support for assistance. Providing details of the steps taken and the issues encountered will help expedite the resolution process.

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard