Multiple PCD Services Down and Hosts Unresponsive Due to Sealed Decco Vault
Problem
Hosts become unresponsive and report as offline.
Critical Platform9 services on the host, such as
pf9-ostackhostandpf9-cindervolume-base,pf9-neutron-ovn-metadata-agententer,pf9-novncproxyentered a failed state.
Environment
Self-Hosted Private Cloud Director Virtualization
Cause
Following pods on the control plane were stuck in the Initializing phase:
Decco Vault in the
defaultnamespaceVouch Keystone in the corresponding region namespace
Vouch NoAuth in the corresponding region namespace
The Decco Vault was found to be in a sealed state. A sealed state is a security posture where the vault's data is encrypted and inaccessible.
Diagnostics
Most of the service logs report connectivity errors.
Resolution
Cordoned the node where the Decco Vault,Vouch Keystone,Vouch NoAuth Pods are scheduled.
If all three pods are located on the same node, only need to cordon that single node. However, if the pods are distributed across different nodes, cordon each node and restart the pods one by one. The primary goal here is to reschedule the pods from their current node.
Performed a rollout restart of the following deployments in the same sequence
Decco Vault in the
defaultnamespaceVouch Keystone in the corresponding region namespace
Vouch NoAuth in the corresponding region namespace
Restarted the pf9-hostagent service on the affected host.
Validation
The hypervisor returned to a healthy state
All Platform9 services were observed to be running normally on the affected node.
Last updated
