Cluster Operations Fail With Error - etcdserver: mvcc: database space exceeded

Problem

We have noticed below logs while performing any operation on the cluster.

Requests are failing with no space warning:

Logs
    
etcdserver: failed to apply request,took 2.429[C2][B5]s,request header:[ID:1920634987875929770 ] txn:[compare:<target:MOD key:"/registry/services/endpoints/kube-system/kube-controller-manager" mod_revision:287319046 ] success:[request_put:<key:"/registry/services/endpoints/kube-system/kube-controller-manager" value_size:473 ]> failure:[]>,resp ,err is etcdserver: no space
Copy

Environment

Platform9 Managed Kubernetes - All Versions
ETCD

Cause

The ETCD database size on master node reached 2.1 GB so ETCD stopped serving new read/write requests due to database size saturation.

Resolution

Follow the below steps to increase the ETCD space to accommodate new write requests. Make sure you have the ETCD backup before proceeding with the steps.

Scale down the master nodes to a single master and perform the steps mentioned below.

If scale down of the master nodes is not possible, stop the PMK stack and update the etcd DB size and disable alarm, on all master nodes and later start the PMK stack on all the master nodes.

Once the PMK stack is up, perform the Step 5 and Step 6 in rolling fashion starting with the non-leader master nodes and at last on the active master node. It is important to note that the defrag action is blocking. The member will not respond until the defrag is complete. For this reason, defrag should be a rolling action.

Note: In this example, the ETCD database space to increased to 6 GB.

Stop the PMK stack.

Command
    
​x
 
systemctl stop pf9-hostagent pf9-nodeletd​/opt/pf9/nodelet/nodeletd phases stop
Copy

For PMK v5.1 and K8s v1.20 and above, append the ETCD_QUOTA BACKEND_BYTES and ETCD_SNAPSHOT_COUNT values to /etc/pf9/kube_override.env file.

Sample
    
 
echo "export ETCD_SNAPSHOT_COUNT=10000" >> /etc/pf9/kube_override.envecho "export ETCD_QUOTA_BACKEND_BYTES=6442450944" >> /etc/pf9/kube_override.env
Copy

Note: The variable ETCD_QUOTA_BACKEND_BYTES is used to increase disk space and ETCD_SNAPSHOT_COUNT to reduce the snapshot count in memory.

For PMK v5.1 and K8s v1.19 and below, modify /opt/pf9/pf9-kube/master_utils.sh file’s ensure_etcd_running() function by adding environment variable "-e ETCD_QUOTA_BACKEND_BYTES=6442450944"and "-e ETCD_SNAPSHOT_COUNT=10000"

These changes won't persist post cluster upgrade.

Start the PMK stack by running command:

Command
    
 
/opt/pf9/nodelet/nodeletd phases start
Copy

Once the PMK stack is up, start the pf9-hostagent service.

Command
    
 
# systemctl start pf9-hostagent
Copy

Once ETCD container is up, the new variable values can be confirmed using: # docker inspect etcd | egrep -i "snapshot|quota"

Run the compaction on ETCD database by logging into the ETCD container.

Docker
Containerd
    
docker exec -it etcd /bin/sh# rev=$(ETCDCTL_API=3 etcdctl --endpoints=:2379 endpoint status --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9].*')​# ETCDCTL_API=3 etcdctl compact $revcompacted revision 163674
Copy

Ensure the ETCD storage is using SSD that provides lower write latency, instead of HDD.

Perform defragmentation on ETCD to reclaim space.

Docker
Containerd
    
 
# ETCDCTL_API=3 etcdctl defrag
Copy

Disable the space exceeded alarms for the etcd container.

Docker
Containerd
    
 
# ETCDCTL_API=3 etcdctl alarm disarm
Copy

Additional Information

If post compaction and defragmentation the ETCD DB size is getting filled within days, enabling ETCD_AUTO_COMPACTION is a better solution. Reference: etcd documentation

Last updated on

Was this page helpful?