Cluster Operations Fail With Error - etcdserver: mvcc: database space exceeded

Problem

We have noticed below logs while performing any operation on the cluster.

etcdserver: mvcc: database space exceeded

Requests are failing with no space warning:

etcdserver: failed to apply request,took 2.429[C2][B5]s,request header:[ID:1920634987875929770 ] txn:[compare:<target:MOD key:"/registry/services/endpoints/kube-system/kube-controller-manager" mod_revision:287319046 ] success:[request_put:<key:"/registry/services/endpoints/kube-system/kube-controller-manager" value_size:473 ]> failure:[]>,resp ,err is etcdserver: no space

Environment

  • Platform9 Managed Kubernetes - All Versions

  • ETCD

Cause

The ETCD database size on master node reached 2.1 GB so ETCD stopped serving new read/write requests due to database size saturation.

Resolution

Follow the below steps to increase the ETCD space to accommodate new write requests. Make sure you have the ETCD backup before proceeding with the steps.

circle-exclamation

Note: In this example, the ETCD database space to increased to 6 GB.

  1. Stop the PMK stack.

  1. For PMK v5.1 and K8s v1.20 and above, append the ETCD_QUOTA BACKEND_BYTES and ETCD_SNAPSHOT_COUNT values to /etc/pf9/kube_override.env file.

Note: The variable ETCD_QUOTA_BACKEND_BYTES is used to increase disk space and ETCD_SNAPSHOT_COUNT to reduce the snapshot count in memory.

circle-info

Note

For PMK v5.1 and K8s v1.19 and below, modify /opt/pf9/pf9-kube/master_utils.sh file’s ensure_etcd_running() function by adding environment variable "-e ETCD_QUOTA_BACKEND_BYTES=6442450944"and "-e ETCD_SNAPSHOT_COUNT=10000"

These changes won't persist post cluster upgrade.

  1. Start the PMK stack by running command:

  1. Once the PMK stack is up, start the pf9-hostagent service.

circle-info

Info

Once ETCD container is up, the new variable values can be confirmed using: # docker inspect etcd | egrep -i "snapshot|quota"

  1. Run the compaction on ETCD database by logging into the ETCD container.

circle-exclamation
  1. Perform defragmentation on ETCD to reclaim space.

  1. Disable the space exceeded alarms for the etcd container.

Additional Information

If post compaction and defragmentation the ETCD DB size is getting filled within days, enabling ETCD_AUTO_COMPACTION is a better solution. Reference: etcd documentationarrow-up-right

Last updated