Vouch-Noauth And Vouch-Keystone Pods Are Not Ready Due To Token Expiry
Problem
The Vouch-Noauth
and Vouch-Keystone
pods are not in a ready state in both Infra and Workload regions. This situation is preventing the environments from being fully operational and has resulted in the upgrade being stalled.
Environment
- Self-Hosted Private Cloud Director Virtualization - v2025.2 to v2025.6
Cause
- Vouch token stored in consul has expired, and it weren't renewed automatically by the
vouch-renew-token
cronjob. - The issue has been reported as a bug, and the Platform Engineering team tracked it under the ID PCD-1468 and the fix has been released in July release.
Diagnostics
vouch-keystone
andvouch-noauth
pods become not ready.
$ kubectl get pods --all-namespaces | grep vouch
[INFRA_NS] vouch-keystone-POD 1/2 Running 0 3h
[INFRA_NS] vouch-noauth-POD 2/3 Running 0 3h
[WORKLOAD_NS] vouch-keystone-POD 1/2 Running 0 3h
[WORKLOAD_NS] vouch-noauth-POD 2/3 Running 0 3h
- Perform the cURL Test
Steps:
- Exec in to vouch-keystone pod and get the vault token from keystone.conf
$ kubectl exec -it -n <AFFECTED_NS> <vouch-keystone-POD> -- bash
[vouch-keystone-POD>]$ grep vault_token /etc/vouch/vouch-keystone.conf | awk '{ print $2 }'
- Run the cURL command after replacing the actual token from above output
$ curl --header "X-Vault-Token: <TOKEN>" "http://decco-vault-active.default.svc.cluster.local:8200/v1/auth/token/lookup-self" -v
$ curl --header "X-Vault-Token: <TOKEN>" "http://decco-vault-active.default.svc.cluster.local:8200/v1/auth/token/lookup-self" -v
Host decco-vault-active.default.svc.cluster.local:8200 was resolved.
...
...
{"errors":["permission denied"]}
* Connection #0 to host decco-vault-active.default.svc.cluster.local left intact
If the token has expired, the output will indicate "Permission denied." as shown above.
Resolution
- Upgrade to Self-hosted Private Cloud Director July release and above version.
Workaround
- Manually renew the expired token so that vouch pods can communicate with consul.
Steps:
- Get the
CONSUL_HTTP_TOKEN
from Airctl host [The host with airctl state file is present]
$ grep consulToken ${HOME}/.airctl/state.yaml | cut -d' ' -f2
- Exec into
decco-consul-consul-server
pod in default namespace
$ kubectl exec -it decco-consul-consul-server-0 -- sh -n default
- Export the COSUL_HTTP_TOKEN from step 1 in
decco-consul-consul-server
pod
[decco-consul-consul-server-0]$ export CONSUL_HTTP_TOKEN="<TOKEN>"
The following commands generate a number of outputs that corresponds to the total number of regions present in the environment.
- Retrieve region UUIDs.
[decco-consul-consul-server-0]$ consul kv get -recurse | grep region_uuid
- The
<REGION_UUID>
serves a crucial role in distinguishing between multiple regions. This unique identifier ensures that each region can be clearly identified and managed effectively within your environment.
region_fqdns/example-infra.platform9.localnet/region_uuid:<REGION_UUID>
region_fqdns/example-workload.platform9.localnet/region_uuid:<REGION_UUID>
- Retrieve existing tokens
[decco-consul-consul-server-0]$ consul kv get -recurse | grep host_signing_token
customers/<CUSTOMER_ID>/regions/REGION_UUID/services/vouch/vault/host_signing_token:hvs.<TOKEN>
- Delete the existing Token for the specified affected region(s).
[decco-consul-consul-server-0]$ consul kv delete customers/<CUSTOMER_ID>/regions/REGION_UUID/services/vouch/vault/host_signing_token
Success! Deleted key: customers/<CUSTOMER_ID>/regions/REGION_UUID/services/vouch/vault/host_signing_token
Exit from the decco-consul-consul-server
pod
- Manually run the
vouch-renew-token
Job
Repeat this step for all affected regions by changing the <AFFECTED_NS>
$ kubectl create job --from=cronjob/vouch-renew-token vouch-renew-token-manual -n <AFFECTED_NS>
- Check if the
Vouch-keystone and Vouch-noauth
back healthy
$ kubectl get pods --all-namespaces | grep vouch
- If these steps prove insufficient to resolve the issue, reach out to the Platform9 Support Team for additional assistance.