Nodelet Fails to Cleanup Data On The Node After it is Removed From The Qbert Cluster

Problem

  • Adding a reused node to the cluster fails to bring up the PMK stack at etcd phase with the below mentioned error in the etcd logs.
Etcd error while bringing up the stack
Copy

Environment

  • Platform9 Managed Kubernetes - v5.4 and above

Cause

  • When a node is detached from the cluster, qbert will remove the corresponding etcd member of that node from the etcd cluster.
  • If the etcd container is running on the node when this happens it crashes.
  • If nodelet is running status check at this point but has not completed the status check of the etcd phase, then the status check for etcd phase fails. This causes nodelet to trigger partial restart of the stack.
  • The start phase of etcd_run.sh phase keeps failing since the etcd member has already been removed from the etcd cluster. This phase has total retry interval of 900s (90 retries with 10s sleep) i.e. 15 min.
  • During this time nodelet does not send any status update to sunpike and therefore does not receive any config updates. The last status update generally reported host state as ok which is picked up by qbert and therefore updates the cluster state as success .
  • Performing a node deauthorization in such state will cause nodelet to not cleanup the /var/opt/pf9/kube/ directory properly.

Resolution

  • A internal JIRA ticket PMK-5586 has been raised to fix this issue.
  • The workaround for this issue is, the entire /var/opt/pf9/kube directory should be cleaned up after the node has been deauthorized before it is onboarded again to same or new cluster.
  • One more thing to ensure is that the value of the config parameter ETCD_ENV must be empty in /etc/pf9/kube.env
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard