Vault Token Expired Prematurely Before Validity Period Ends

Problem

  • Nodes in a cluster were getting stuck after a reboot at the Generate certs nodeletd phase.
Nodeletd Phases status command
Copy
  • The Certificate Signing Request for the needed certificates is failing due to a permission denied error.
Bash
Copy

Environment

  • Platform9 Managed Kubernetes - v5.7 and Higher
  • Platform9 Self Managed Cloud Platform - v5.9 and Higher
  • Vault

Cause

  • The Vault Token of the Cluster got expired.
  • This is a known issue, and a BUG has been reported with ID PMK-6602 to track and resolve it.

Vault Token: This token is issued to each workload cluster by the pf9-vault service that operates on the management plane. It is utilized by the pf9-nodeletd service running on nodes to request certificates from the Management Plane.

Validation

Steps to validate the token expiry:

  1. Exec into pf9-vault pod in Management Plane namespace.
Command
Copy
  1. Export the required details.
Command
Copy
  1. Run the below command to know token expiry details:
Command
Copy

Example:

Command
Copy

Workaround

To fix this issue, renew the Vault Token for the problematic cluster and update all hosts with the new Token.

Kindly ensure to document each step as it is executed. This will help maintain a clear and comprehensive record of the process.

  • For PMK (SaaS), the platform9 support team will apply the steps below. Please open a Support Ticket.
  • For SMCP (air-gapped), perform the steps below from the management plane cluster.

Step 1: Exec into pf9-vault pod in Management Plane namespace.

kubectl commands
Copy

Step 2: Export the required details.

Bash
Copy

Step 3: Generate New Token.

Bash
Copy

Step 4: Update the new token in qbert Database and exit from pf9-vault pod.

Bash
Copy

Step 5: Verify if the new token is updated at the cluster and node levels.

kubectl commands
Copy

Step 6: If the token in Sunpike does not match the token in Qbert, execute the following command to patch the Sunpike host object.

kubectl commands
Copy

Step 7: Perform full stack restart on nodes that got stuck at the Cert Generation phase (if any).

Bash
Copy

Step 8: Revoke the old Token only if all nodes are working fine. (Optional)

Kubectl Commands
Copy

Additional Information

  • An internal BUG PMK-6602 has been filed to track this issue. For more details, kindly reach out to the Platform9 Support Team mentioning in the BUG ID.
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard