Multiple PCD Services Down and Hosts Unresponsive Due to Sealed Decco Vault

Problem

Hosts become unresponsive and report as offline.
Critical Platform9 services on the host, such as pf9-ostackhost and pf9-cindervolume-base , pf9-neutron-ovn-metadata-agententer , pf9-novncproxy entered a failed state.

Environment

Self-Hosted Private Cloud Director Virtualization - v2025.4 and Higher

Cause

Following pods on the control plane were stuck in the Initializing phase:

Decco Vault in the default namespace
Vouch Keystone in the corresponding region namespace
Vouch NoAuth in the corresponding region namespace

The Decco Vault was found to be in a sealed state. A sealed state is a security posture where the vault's data is encrypted and inaccessible.

Diagnostics

Most of the service logs report connectivity errors.

ostackhost logs
    
ERROR oslo.messaging._drivers.impl_rabbit [REQ_ID] AMQP server on 127.0.0.1:5673 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>ERROR oslo.messaging._drivers.impl_rabbit [REQ_ID] AMQP server on 127.0.0.1:5673 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 2 seconds.: ConnectionRefusedError: [Errno 111] ECONNREFUSED
Copy

cindervolume-base logs
    
ERROR oslo.messaging._drivers.impl_rabbit [-] [REQ_ID] AMQP server on 127.0.0.1:5673 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 2 seconds.: ConnectionRefusedError: [Errno 111] ECONNREFUSEDEADDRINUSE encountered during SNI client creation ... retrying in 20000 milliseconds/ERRORnn:qlscd ..lsvim novG/ERROR:qvim cinvG/ERRORmodel server went away: oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')n
Copy

Resolution

Cordoned the node where the Decco Vault,Vouch Keystone,Vouch NoAuth Pods are scheduled.

Command
    
 
$ kubectl cordon <node_name>
Copy

If all three pods are located on the same node, only need to cordon that single node. However, if the pods are distributed across different nodes, cordon each node and restart the pods one by one. The primary goal here is to reschedule the pods from their current node.

Performed a rollout restart of the following deployments in the same sequence

Decco Vault in the default namespace
Vouch Keystone in the corresponding region namespace
Vouch NoAuth in the corresponding region namespace

Command
    
 
$ kubectl rollout restart deployment <deployment-name> -n <namespace>
Copy

Restarted the pf9-hostagent service on the affected host.

Command
    
 
# systemctl restart pf9-hostagent
Copy

Validation

The hypervisor returned to a healthy state
All Platform9 services were observed to be running normally on the affected node.

Last updated on

Was this page helpful?