Multiple PCD Services Down and Hosts Unresponsive Due to Sealed Decco Vault

Problem

  • Hosts become unresponsive and report as offline.
  • Critical Platform9 services on the host, such as pf9-ostackhost and pf9-cindervolume-base , pf9-neutron-ovn-metadata-agententer , pf9-novncproxy entered a failed state.

Environment

  • Self-Hosted Private Cloud Director Virtualization - v2025.4 and Higher

Cause

Following pods on the control plane were stuck in the Initializing phase:

  • Decco Vault in the default namespace
  • Vouch Keystone in the corresponding region namespace
  • Vouch NoAuth in the corresponding region namespace

The Decco Vault was found to be in a sealed state. A sealed state is a security posture where the vault's data is encrypted and inaccessible.

Diagnostics

Most of the service logs report connectivity errors.

ostackhost logs
Copy
cindervolume-base logs
Copy

Resolution

  1. Cordoned the node where the Decco Vault,Vouch Keystone,Vouch NoAuth Pods are scheduled.
Command
Copy

If all three pods are located on the same node, only need to cordon that single node. However, if the pods are distributed across different nodes, cordon each node and restart the pods one by one. The primary goal here is to reschedule the pods from their current node.

  1. Performed a rollout restart of the following deployments in the same sequence
  • Decco Vault in the default namespace
  • Vouch Keystone in the corresponding region namespace
  • Vouch NoAuth in the corresponding region namespace
Command
Copy
  1. Restarted the pf9-hostagent service on the affected host.
Command
Copy

Validation

  1. The hypervisor returned to a healthy state
  2. All Platform9 services were observed to be running normally on the affected node.
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard