Multiple PCD Services Down and Hosts Unresponsive Due to Sealed Decco Vault
Problem
- Hosts become unresponsive and report as offline.
- Critical Platform9 services on the host, such as
pf9-ostackhost
andpf9-cindervolume-base
,pf9-neutron-ovn-metadata-agententer
,pf9-novncproxy
entered a failed state.
Environment
- Self-Hosted Private Cloud Director Virtualization - v2025.4 and Higher
Cause
Following pods on the control plane were stuck in the Initializing phase:
- Decco Vault in the
default
namespace - Vouch Keystone in the corresponding region namespace
- Vouch NoAuth in the corresponding region namespace
The Decco Vault was found to be in a sealed state. A sealed state is a security posture where the vault's data is encrypted and inaccessible.
Diagnostics
Most of the service logs report connectivity errors.
ERROR oslo.messaging._drivers.impl_rabbit [REQ_ID] AMQP server on 127.0.0.1:5673 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
ERROR oslo.messaging._drivers.impl_rabbit [REQ_ID] AMQP server on 127.0.0.1:5673 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 2 seconds.: ConnectionRefusedError: [Errno 111] ECONNREFUSED
ERROR oslo.messaging._drivers.impl_rabbit [-] [REQ_ID] AMQP server on 127.0.0.1:5673 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 2 seconds.: ConnectionRefusedError: [Errno 111] ECONNREFUSEDEADDRINUSE encountered during SNI client creation ... retrying in 20000 milliseconds/ERRORnn:qlscd ..lsvim novG/ERROR:qvim cinvG/ERRORmodel server went away: oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')n
Resolution
- Cordoned the node where the Decco Vault,Vouch Keystone,Vouch NoAuth Pods are scheduled.
$ kubectl cordon <node_name>
If all three pods are located on the same node, only need to cordon that single node. However, if the pods are distributed across different nodes, cordon each node and restart the pods one by one. The primary goal here is to reschedule the pods from their current node.
- Performed a rollout restart of the following deployments in the same sequence
- Decco Vault in the
default
namespace - Vouch Keystone in the corresponding region namespace
- Vouch NoAuth in the corresponding region namespace
$ kubectl rollout restart deployment <deployment-name> -n <namespace>
- Restarted the pf9-hostagent service on the affected host.
# systemctl restart pf9-hostagent
Validation
- The hypervisor returned to a healthy state
- All Platform9 services were observed to be running normally on the affected node.
Was this page helpful?