Vouch-Renew-Token job Fails with Vault Permission Denied – Renewing Expired Vault admin_token
Problem
The Vouch authentication services in a PCD environment are stuck in a failing state because the admin_token used by the automatic renewal process has expired. Without a valid admin_token, the renewer cannot refresh Vouch's other credentials, and the services cannot recover on their own.
The process of manually replacing the expired admin_token so the automatic renewal can resume.
Environment
Self-Hosted Private Cloud Director Virtualization - v2025.7-47 and later
Private Cloud Director Virtualization - v2025.7-47 and later
Component: Vault
Cause
Vouch authenticates to Vault using a chain of two tokens:
host_signing_token (leaf)
Used by the running vouch service to sign certificates and validate authentication requests at runtime.
customers/<CUSTOMER_ID>/regions/<REGION_UUID>/services/vouch/vault/host_signing_token
admin_token (parent)
Used by the vouch-renew-token CronJob to authenticate to Vault and mint a fresh host_signing_token when the leaf one nears expiry.
customers/<CUSTOMER_ID>/vault_servers/dev/admin_token (and per-region copies — see Workaround)
When the parent admin_token expires, the CronJob's first authenticated call to Vault - the request to create the host signing policy - is rejected with a permission-denied error. The renewer can no longer mint a new host_signing_token. The leaf token then ages out, and the vouch pods enter CrashLoopBackOff because every authenticated call upstream is rejected with 403.
The auto-renewal protection delivered by KB Vouch-Noauth And Vouch-Keystone Pods Are Not Ready Due To Token Expiry only covers the leaf host_signing_token. Rotation/generation of the parent admin_token is not currently covered, which is what causes this chained-expiry condition.
Diagnostics
Pre-requisites
For SaaS customer, reach out to Platform9 Support.
For Self Hosted PCD customers, follow the steps below:
Shell access to a PCD control plane node, with
kubectlandairctlalready configured.Get CUSTOMER_ID of the deployment by running the command below from the control plane node:
Get REGION_UUID for each affected region using:
To list the REGION_UUID of every affected region in one go:
The first column is the REGION_NS (region namespace), the second is the REGION_UUID. Keep these values handy to substitute them into the commands below wherever <CUSTOMER_ID>, <REGION_UUID>, or <REGION_NS> are referred
Run the steps below in order. Each one is a separate check; together they confirm the issue is the chained admin_token expiry.
Review the renewal job logs for the Vault 403 error
Find the failed renewal job's pod:
View the pod's logs (replace <RENEWER_POD_NAME> with the pod name from the previous command):
The log will end with Vault rejecting a policy-creation request:
If all four checks above match, the vouch services cannot recover on their own because the credential the renewer needs to perform the recovery has itself expired. Continue with the additional diagnostic steps below to confirm the admin_token has expired before applying the workaround.
Read the admin_token from Consul
The renewer reads its admin_token from a Consul key-value store. To inspect that value requires Consul ACL token (which is required to read Consul). It is stored in the airctl state file on the control plane node:
Expected: a non-zero length is printed (typical Consul tokens are 36 characters). If length is 0, the file path is wrong or the key isn't present - check ${HOME}/.airctl/state.yaml manually before continuing.
Now read the admin_token from the Consul server pod and using the consul kv get command
The output is a single hvs.[..] string - that's the admin_token. Type exit to leave the Consul pod when done.
Validate the admin_token against Vault
Login to the vault pod that is in Running state:
Set the Vault address and token, then look up the token:
If Error looking up token: ... Code: 403. Errors: * permission denied
is printed then the admin_token is expired or revoked. Proceed to Workaround
A table with positive ttl and a future expire_time means the admin_token is valid; the renewer's 403 is a different problem (most likely the policy attached to the token has been narrowed). Open a Platform9 support ticket; Do NOT run the workaround.
The goal is to confirm the admin_token has actually expired before doing anything destructive.
Workaround
Perform the Pre-requisites
Mint a new admin_token in Vault
Generate a fresh admin_token in Vault with a long lifetime using vault token create from inside the Vault pod. The flags request a 768-hour (32-day) TTL and a renewable period.
The output is a JSON block. Value of auth.client_token field starting with [hvs...] is the new admin_token to be referred to as <NEW_ADMIN_TOKEN> in the steps below
Write the new token to all required Consul KV paths
The new admin_token has to be written to multiple Consul paths. All paths must be updated. NOTE: Partial writes will leave the environment in a worse state than before.
Exec into the Consul server pod and export the Consul ACL token
Inside the Consul server pod
Write the new token. The first path is customer-level (one entry per customer ID). The next two are per-region - repeat them for every affected region under this customer:
Each successful put prints Success! Data written to: <path>.
Type exit to leave the Consul pod.
Update the deccaxon Kubernetes secret in every relevant namespace
The new token also has to be written into the VaultToken field of a Kubernetes secret called deccaxon. This secret exists in several namespaces and each copy must be updated.
The value stored in the secret must be base64-encoded (Kubernetes Secrets store values base64-encoded). Generate the encoded string:
The output is a single line of base64 text. Copy it.
To list the namespaces where the deccaxon secret exists, it typically exists in kplane the namespace, the customer's infra namespace, and every region namespace.
For each namespace listed above, edit the secret:
Find the line that reads VaultToken: <SOME_BASE64_ENCODED_STRING>. Replace its value with the base64-encoded string from earlier.
Repeat for every namespace where the secret exists.
Re-run the vouch-renew-token job in each affected region
Now that the admin_token is valid, trigger the renewal job again. Run this for every affected region:
The $(date +%s) suffix ensures each manual job has a unique name (Kubernetes will not let you create two jobs with the same name). Watch the job until it reaches Complete 1/1. Press Ctrl+C once you see it complete:
If the job stays in Failed 0/1 or never reaches Complete, capture the renewer pod logs and contact Platform9 support before proceeding to the next step.
Validate the renewed host_signing_token
Confirm the renewer wrote a valid host_signing_token by reading it from Consul and looking it up against Vault
Look for positive ttl and a future expire_time. Repeat this for every affected region. If any region's lookup still returns permission denied, stop and contact Platform9 support.
Last updated
