Cluster Operations Fail With Error - etcdserver: mvcc: database space exceeded
Problem
We have noticed below logs while performing any operation on the cluster.
Requests are failing with no space warning:
etcdserver: failed to apply request,took 2.429[C2][B5]s,request header:[ID:1920634987875929770 ] txn:[compare:<target:MOD key:"/registry/services/endpoints/kube-system/kube-controller-manager" mod_revision:287319046 ] success:[request_put:<key:"/registry/services/endpoints/kube-system/kube-controller-manager" value_size:473 ]> failure:[]>,resp ,err is etcdserver: no space
Environment
- Platform9 Managed Kubernetes - All Versions
- ETCD
Cause
The ETCD database size on master node reached 2.1 GB so ETCD stopped serving new read/write requests due to database size saturation.
Resolution
Follow the below steps to increase the ETCD space to accommodate new write requests. Make sure you have the ETCD backup before proceeding with the steps.
Scale down the master nodes to a single master and perform the steps mentioned below.
If scale down of the master nodes is not possible, stop the PMK stack and update the etcd DB size and disable alarm, on all master nodes and later start the PMK stack on all the master nodes.
Once the PMK stack is up, perform the Step 5 and Step 6 in rolling fashion starting with the non-leader master nodes and at last on the active master node. It is important to note that the defrag action is blocking. The member will not respond until the defrag is complete. For this reason, defrag should be a rolling action.
Note: In this example, the ETCD database space to increased to 6 GB.
- Stop the PMK stack.
systemctl stop pf9-hostagent pf9-nodeletd
/opt/pf9/nodelet/nodeletd phases stop
- For PMK v5.1 and K8s v1.20 and above, append the
ETCD_QUOTA BACKEND_BYTES
andETCD_SNAPSHOT_COUNT
values to /etc/pf9/kube_override.env file.
echo "export ETCD_SNAPSHOT_COUNT=10000" >> /etc/pf9/kube_override.env
echo "export ETCD_QUOTA_BACKEND_BYTES=6442450944" >> /etc/pf9/kube_override.env
Note: The variable ETCD_QUOTA_BACKEND_BYTES
is used to increase disk space and ETCD_SNAPSHOT_COUNT
to reduce the snapshot count in memory.
For PMK v5.1 and K8s v1.19 and below, modify /opt/pf9/pf9-kube/master_utils.sh file’s ensure_etcd_running()
function by adding environment variable "-e ETCD_QUOTA_BACKEND_BYTES=6442450944"
and "-e ETCD_SNAPSHOT_COUNT=10000"
These changes won't persist post cluster upgrade.
- Start the PMK stack by running command:
/opt/pf9/nodelet/nodeletd phases start
- Once the PMK stack is up, start the pf9-hostagent service.
# systemctl start pf9-hostagent
Once ETCD container is up, the new variable values can be confirmed using:
# docker inspect etcd | egrep -i "snapshot|quota"
- Run the compaction on ETCD database by logging into the ETCD container.
docker exec -it etcd /bin/sh
# rev=$(ETCDCTL_API=3 etcdctl --endpoints=:2379 endpoint status --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9].*')
# ETCDCTL_API=3 etcdctl compact $rev
compacted revision 163674
Ensure the ETCD storage is using SSD that provides lower write latency, instead of HDD.
- Perform defragmentation on ETCD to reclaim space.
# ETCDCTL_API=3 etcdctl defrag
- Disable the space exceeded alarms for the etcd container.
# ETCDCTL_API=3 etcdctl alarm disarm
Additional Information
If post compaction and defragmentation the ETCD DB size is getting filled within days, enabling ETCD_AUTO_COMPACTION
is a better solution. Reference: etcd documentation