Vouch-Renew-Token job Fails with Vault Permission Denied – Renewing Expired Vault admin_token

Problem

The Vouch authentication services in a PCD environment are stuck in a failing state because the admin_token used by the automatic renewal process has expired. Without a valid admin_token, the renewer cannot refresh Vouch's other credentials, and the services cannot recover on their own.

The process of manually replacing the expired admin_token so the automatic renewal can resume.

Environment

  • Self-Hosted Private Cloud Director Virtualization - v2025.7-47 and later

  • Private Cloud Director Virtualization - v2025.7-47 and later

  • Component: Vault

Cause

Vouch authenticates to Vault using a chain of two tokens:

Token
Purpose
Stored in Consul at

host_signing_token (leaf)

Used by the running vouch service to sign certificates and validate authentication requests at runtime.

customers/<CUSTOMER_ID>/regions/<REGION_UUID>/services/vouch/vault/host_signing_token

admin_token (parent)

Used by the vouch-renew-token CronJob to authenticate to Vault and mint a fresh host_signing_token when the leaf one nears expiry.

customers/<CUSTOMER_ID>/vault_servers/dev/admin_token (and per-region copies — see Workaround)

When the parent admin_token expires, the CronJob's first authenticated call to Vault - the request to create the host signing policy - is rejected with a permission-denied error. The renewer can no longer mint a new host_signing_token. The leaf token then ages out, and the vouch pods enter CrashLoopBackOff because every authenticated call upstream is rejected with 403.

The auto-renewal protection delivered by KB Vouch-Noauth And Vouch-Keystone Pods Are Not Ready Due To Token Expiry only covers the leaf host_signing_token. Rotation/generation of the parent admin_token is not currently covered, which is what causes this chained-expiry condition.

Diagnostics

Pre-requisites

circle-info

For SaaS customer, reach out to Platform9 Support.

For Self Hosted PCD customers, follow the steps below:

  • Shell access to a PCD control plane node, with kubectl and airctl already configured.

  • Get CUSTOMER_ID of the deployment by running the command below from the control plane node:

  • Get REGION_UUID for each affected region using:

  • To list the REGION_UUID of every affected region in one go:

The first column is the REGION_NS (region namespace), the second is the REGION_UUID. Keep these values handy to substitute them into the commands below wherever <CUSTOMER_ID>, <REGION_UUID>, or <REGION_NS> are referred

Run the steps below in order. Each one is a separate check; together they confirm the issue is the chained admin_token expiry.

1

Check the overall region health

If this issue applies, you will see region health: ⚠️ Not Ready and the ready-services count below the desired count (for example 28/30, 83/85):

2

Check the state of the vouch pods

List the vouch pods across all namespaces:

The vouch-keystone and vouch-noauth pods will be stuck with a partial Ready ratio (1/2 and 2/3), in CrashLoopBackOff, with a high RESTARTS count:

3

Check the status of the vouch-renew-token job

List the renewal jobs in the affected region's namespace:

If the auto-renewer ran recently (or was manually triggered as part of the standard host_signing_token recovery procedure), you will see one or more rows with status Failed 0/1:

4

Review the renewal job logs for the Vault 403 error

Find the failed renewal job's pod:

View the pod's logs (replace <RENEWER_POD_NAME> with the pod name from the previous command):

The log will end with Vault rejecting a policy-creation request:

If all four checks above match, the vouch services cannot recover on their own because the credential the renewer needs to perform the recovery has itself expired. Continue with the additional diagnostic steps below to confirm the admin_token has expired before applying the workaround.

5

Confirm the renewer is failing with the policy-creation 403

Find the renewer pod:

View the pod logs

If the log shows 403 Client Error: Forbidden against the Vault sys/policy endpoint, proceed to Step 7 - Validate the admin_token against Vault.

6

Read the admin_token from Consul

The renewer reads its admin_token from a Consul key-value store. To inspect that value requires Consul ACL token (which is required to read Consul). It is stored in the airctl state file on the control plane node:

Expected: a non-zero length is printed (typical Consul tokens are 36 characters). If length is 0, the file path is wrong or the key isn't present - check ${HOME}/.airctl/state.yaml manually before continuing.

Now read the admin_token from the Consul server pod and using the consul kv get command

The output is a single hvs.[..] string - that's the admin_token. Type exit to leave the Consul pod when done.

7

Validate the admin_token against Vault

Login to the vault pod that is in Running state:

Set the Vault address and token, then look up the token:

If Error looking up token: ... Code: 403. Errors: * permission denied is printed then the admin_token is expired or revoked. Proceed to Workaround

A table with positive ttl and a future expire_time means the admin_token is valid; the renewer's 403 is a different problem (most likely the policy attached to the token has been narrowed). Open a Platform9 support ticket; Do NOT run the workaround.

  • The goal is to confirm the admin_token has actually expired before doing anything destructive.

Workaround

1

Perform the Pre-requisites

2

Mint a new admin_token in Vault

Generate a fresh admin_token in Vault with a long lifetime using vault token create from inside the Vault pod. The flags request a 768-hour (32-day) TTL and a renewable period.

The output is a JSON block. Value of auth.client_token field starting with [hvs...] is the new admin_token to be referred to as <NEW_ADMIN_TOKEN> in the steps below

3

Write the new token to all required Consul KV paths

The new admin_token has to be written to multiple Consul paths. All paths must be updated. NOTE: Partial writes will leave the environment in a worse state than before.

Exec into the Consul server pod and export the Consul ACL token

Inside the Consul server pod

4

Write the new token. The first path is customer-level (one entry per customer ID). The next two are per-region - repeat them for every affected region under this customer:

Each successful put prints Success! Data written to: <path>.

Type exit to leave the Consul pod.

5

Update the deccaxon Kubernetes secret in every relevant namespace

The new token also has to be written into the VaultToken field of a Kubernetes secret called deccaxon. This secret exists in several namespaces and each copy must be updated.

The value stored in the secret must be base64-encoded (Kubernetes Secrets store values base64-encoded). Generate the encoded string:

The output is a single line of base64 text. Copy it.

To list the namespaces where the deccaxon secret exists, it typically exists in kplane the namespace, the customer's infra namespace, and every region namespace.

For each namespace listed above, edit the secret:

Find the line that reads VaultToken: <SOME_BASE64_ENCODED_STRING>. Replace its value with the base64-encoded string from earlier.

Repeat for every namespace where the secret exists.

6

Re-run the vouch-renew-token job in each affected region

Now that the admin_token is valid, trigger the renewal job again. Run this for every affected region:

The $(date +%s) suffix ensures each manual job has a unique name (Kubernetes will not let you create two jobs with the same name). Watch the job until it reaches Complete 1/1. Press Ctrl+C once you see it complete:

If the job stays in Failed 0/1 or never reaches Complete, capture the renewer pod logs and contact Platform9 support before proceeding to the next step.

7

Validate the renewed host_signing_token

Confirm the renewer wrote a valid host_signing_token by reading it from Consul and looking it up against Vault

Look for positive ttl and a future expire_time. Repeat this for every affected region. If any region's lookup still returns permission denied, stop and contact Platform9 support.

8

Restart the vouch deployments to pick up the new tokens

The running vouch pods read their tokens at startup, so they need to be restarted to pick up the renewed values:

Run this for every affected region.

9

Verify recovery

Watch for the vouch pods come back to a healthy state.

Then confirm the overall region health:

Ensure Full service counts (e.g. 30/30 ready, 85/85 ready) for every region and region health: Ready

Last updated