Cluster Nodes in NotReady State as the Kube Certificates Expired

Problem

  • The kube certificates in the directory /etc/pf9/kube.d/certs/ are not automatically renewed.

  • pf9-nodelet.service logs the below errors:

{"L":"ERROR","T":"2025-05-01T13:23:12.660Z","C":"nodelet/nodelet.go:79","M":"Failed to reconcile host: error sending status update to sunpike: rpc error: code = Unknown desc = apiserver storage error: an error on the server (\"Internal Server Error: \\\"/apis/sunpike.platform9.com/v1alpha1/hosts/ed083915-a286-483a-8913-05c579338439\\\": Unauthorized\") has prevented the request from succeeding (get hosts.sunpike.platform9.com ed083915-a286-483a-8913-05c579338439)"}

Environment

  • Platform9 Edge Cloud - v-5.3.0-2075501 and Higher

Cause

  • This issue only occurs in the below conditions:

    • The parameters default_lease_ttl and max_lease_ttl in file etc/pf9-vault.d/server-config.hcl on the duVM are modified from existing TTL of 26280h to lower values.

    • nodelet phases are restarted on a node/s.

  • This causes the certificates on the node to be renewed with the new default_lease_ttl expiry date and the certificate is not auto-renewed causing it to become invalid post it's expiry.

  • This is known bug already reported internally with ID: AIR-1459

Workaround

  • The workaround to this problem is to restart the nodelet phases with --regen-certs parameter.

  • Perform the commands below in the same sequence.

  • Wait for 5-7 mins for the node to reconcile and observe if the node is Healthy.

Additional Information

  • Reach out to Platform9 Support Team for any additional questions/concerns regarding the bug.

Last updated