Nodelet Fails to Cleanup Data On The Node After it is Removed From The Qbert Cluster
Problem
- Adding a reused node to the cluster fails to bring up the PMK stack at etcd phase with the below mentioned error in the etcd logs.
{"log":"{\"level\":\"warn\",\"ts\":\"2023-01-13T05:22:18.859Z\",\"caller\":\"etcdserver/server.go:1095\",\"msg\":\"server error\",\"error\":\"the member has been permanently removed from the cluster\"}\n","stream":"stderr","time":"2023-01-13T05:22:18.860123163Z"}
{"log":"{\"level\":\"warn\",\"ts\":\"2023-01-13T05:22:18.860Z\",\"caller\":\"etcdserver/server.go:1096\",\"msg\":\"data-dir used by this member must be removed\"}\n","stream":"stderr","time":"2023-01-13T05:22:18.860139609Z"}
Environment
- Platform9 Managed Kubernetes - v5.4 and above
Cause
- When a node is detached from the cluster, qbert will remove the corresponding etcd member of that node from the etcd cluster.
- If the etcd container is running on the node when this happens it crashes.
- If nodelet is running status check at this point but has not completed the status check of the etcd phase, then the status check for etcd phase fails. This causes nodelet to trigger partial restart of the stack.
- The start phase of etcd_run.sh phase keeps failing since the etcd member has already been removed from the etcd cluster. This phase has total retry interval of 900s (90 retries with 10s sleep) i.e. 15 min.
- During this time nodelet does not send any status update to sunpike and therefore does not receive any config updates. The last status update generally reported host state as ok which is picked up by qbert and therefore updates the cluster state as
success
. - Performing a node deauthorization in such state will cause nodelet to not cleanup the
/var/opt/pf9/kube/
directory properly.
Resolution
- A internal JIRA ticket PMK-5586 has been raised to fix this issue.
- The workaround for this issue is, the entire
/var/opt/pf9/kube
directory should be cleaned up after the node has been deauthorized before it is onboarded again to same or new cluster. - One more thing to ensure is that the value of the config parameter
ETCD_ENV
must be empty in/etc/pf9/kube.env
Was this page helpful?