Cluster Operations Fail With Error - etcdserver: mvcc: database space exceeded

Problem

We have noticed below logs while performing any operation on the cluster.

Copy

Requests are failing with no space warning:

Logs
Copy

Environment

  • Platform9 Managed Kubernetes - All Versions
  • ETCD

Cause

The ETCD database size on master node reached 2.1 GB so ETCD stopped serving new read/write requests due to database size saturation.

Resolution

Follow the below steps to increase the ETCD space to accommodate new write requests. Make sure you have the ETCD backup before proceeding with the steps.

Scale down the master nodes to a single master and perform the steps mentioned below.

If scale down of the master nodes is not possible, stop the PMK stack and update the etcd DB size and disable alarm, on all master nodes and later start the PMK stack on all the master nodes.

Once the PMK stack is up, perform the Step 5 and Step 6 in rolling fashion starting with the non-leader master nodes and at last on the active master node. It is important to note that the defrag action is blocking. The member will not respond until the defrag is complete. For this reason, defrag should be a rolling action.

Note: In this example, the ETCD database space to increased to 6 GB.

  1. Stop the PMK stack.
Command
Copy
  1. For PMK v5.1 and K8s v1.20 and above, append the ETCD_QUOTA BACKEND_BYTES and ETCD_SNAPSHOT_COUNT values to /etc/pf9/kube_override.env file.
Sample
Copy

Note: The variable ETCD_QUOTA_BACKEND_BYTES is used to increase disk space and ETCD_SNAPSHOT_COUNT to reduce the snapshot count in memory.

For PMK v5.1 and K8s v1.19 and below, modify /opt/pf9/pf9-kube/master_utils.sh file’s ensure_etcd_running() function by adding environment variable "-e ETCD_QUOTA_BACKEND_BYTES=6442450944"and "-e ETCD_SNAPSHOT_COUNT=10000"

These changes won't persist post cluster upgrade.

  1. Start the PMK stack by running command:
Command
Copy
  1. Once the PMK stack is up, start the pf9-hostagent service.
Command
Copy

Once ETCD container is up, the new variable values can be confirmed using: # docker inspect etcd | egrep -i "snapshot|quota"

  1. Run the compaction on ETCD database by logging into the ETCD container.
Docker
Containerd
Copy

Ensure the ETCD storage is using SSD that provides lower write latency, instead of HDD.

  1. Perform defragmentation on ETCD to reclaim space.
Docker
Containerd
Copy
  1. Disable the space exceeded alarms for the etcd container.
Docker
Containerd
Copy

Additional Information

If post compaction and defragmentation the ETCD DB size is getting filled within days, enabling ETCD_AUTO_COMPACTION is a better solution. Reference: etcd documentation

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard