Troubleshooting Cluster Issues

Cluster Creation

Public Cloud Provider

Make sure the permissions for the account you provided to PMK as part of cloud provider creation has all the required privileges. See the AWS and Azure pre-requisites under Getting Started section for more details

Cluster Creation Fails for BareOS

Navigate to Infrastructure -> Clusters tab.
Click on the cluster name. This will take you to the cluster details page.
Click on the “Node Health” tab

Here you should see detailed breakdown of which nodes failed to install and which specific steps failed. Next, check Troubleshooting Node Issues.

Etcd

Heartbeat/Election Timeout Interval

Bash
    
2021-02-04 18:36:31.380207 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 124.999498ms, to 92d6e239c543436)2021-02-04 18:36:31.380220 W | etcdserver: server is likely overloaded2021-02-04 18:36:31.382208 W | etcdserver: read-only range request "key:\"/registry/mutatingwebhookconfigurations/vault-agent-injector-cfg\" " with result "range_response_count:1 size:2723" took too long (264.355727ms) to execute
Copy

ETCD_HEARTBEAT_INTERVAL - This is the frequency with which the leader will notify followers that it is still the leader.

ETCD_ELECTION_TIMEOUT - This timeout is how long a follower node will go without hearing a heartbeat before attempting to become a leader itself.

By default, etcd uses a100msheartbeat interval and1000mselection timeout.

Bash
    
 
# cat /etc/pf9/kube.env | grep -i etcdexport ETCD_HEARTBEAT_INTERVAL="1000"export ETCD_ELECTION_TIMEOUT="10000"
Copy

Database Size Exceeded

Bash
    
etcdserver: failed to apply request,took 2.429<C2><B5>s,request header:<ID:1920634987875929770 > txn:<compare:<target:MOD key:"/registry/services/endpoints/kube-system/kube-controller-manager" mod_revision:287319046 > success:<request_put:<key:"/registry/services/endpoints/kube-system/kube-controller-manager" value_size:473 >> failure:<>>,resp ,err is etcdserver: no space
Copy

Stop the pf9-hostagent and nodeletd services on the master node(s).

Bash
    
 
sudo systemctl stop pf9-{hostagent,nodeletd}
Copy

Issue a stop for the Nodelet phases.

Bash
    
 
/opt/pf9/nodelet/nodeletd phases stop
Copy

In /opt/pf9/pf9-kube/master_utils.sh , modify the function ensure_etcd__r_unning()to add the following environment variable.

/opt/pf9/pf9-kube/master_utils.sh
    
 
--volume ${ETCD_DATA_DIR}:/var/etcd/data \        -e ETCD_DEBUG=${DEBUG}        -e ETCD_QUOTA_BACKEND_BYTES=<size_in_bytes>"
Copy

Start the pf9-hostagent service.

Bash
    
 
sudo systemctl start pf9-hostagent
Copy

Verify the size was correctly set by scraping the etcd metrics endpoint.

Bash
    
 
curl -L http://localhost:2379/metrics | grep etcd_server_quota_backend_bytes
Copy

Last updated by Chris Jones on Oct 21, 2021

Was this page helpful?