Percona PXC Cluster Crash Expired Galera SSL Certificates

Problem

  • The PCD management plane becomes partially or fully non-functional, with the GUI displaying the following alert:

PCD UI Alert
One or more management plane services are degraded. Some actions might take longer or fail unexpectedly.
  • Hosts stop appearing in the PCD UI - the Hosts page is empty or shows no results.

  • Running airctl status on the control plane node returns a fatal error:

Control Plane Node
fatal error: failed to getting status: failed to get the consul token: failed to unmarshal kplaneToken: unexpected end of JSON input
  • The percona-db-pxc-db-haproxy-0/1/2 pods report only 1/2 containers running with a rapidly increasing restart count.

Control Plane Node
kubectl get pods -n <REGION_NAMESPACE> -owide | grep percona
percona-db-pxc-db-haproxy-0   1/2   Running     [HIGH_RESTART_COUNT]   [AGE]   [POD_IP]    [NODE_IP]   <none>   <none>
percona-db-pxc-db-haproxy-1   1/2   Running     [HIGH_RESTART_COUNT]   [AGE]   [POD_IP]    [NODE_IP]   <none>   <none>
percona-db-pxc-db-haproxy-2   1/2   Running     [HIGH_RESTART_COUNT]   [AGE]   [POD_IP]    [NODE_IP]   <none>   <none>
percona-db-pxc-db-pxc-0       3/3   Running     [RESTART_COUNT]        [AGE]   [POD_IP]    [NODE_IP]   <none>   <none>
percona-db-pxc-db-pxc-1       3/3   Running     [RESTART_COUNT]        [AGE]   [POD_IP]    [NODE_IP]   <none>   <none>
percona-db-pxc-db-pxc-2       0/3   Terminating [RESTART_COUNT]        [AGE]   [POD_IP]    [NODE_IP]   <none>   <none>
  • The resmgr pod loses MySQL connectivity, logging the following error:

Sample Output
MySQLdb.OperationalError: (2002, "Can't connect to server on 'percona-db-pxc-db-haproxy.[REGION_NS].svc.cluster.local' (115)")
  • Encountered repeated sslv3 alert certificate expired errors on Galera port 4567 from pxc pod logs.

Sample Output — Expired SSL Certificates
sslv3 alert certificate expired
error:14094415:SSL routines:ssl3_read_bytes:sslv3 alert certificate expired

Environment

  • Self-Hosted Private Cloud Director Virtualization - v2025.10 and Higher

  • Component: Percona XtraDB Cluster (PXC)

Cause

The Percona XtraDB Cluster operator manages internal TLS certificates for Galera inter-node communication on port 4567. These certificates are self-signed and operator-rotated with no built-in expiry alerting or automated rotation. When the certificates expire, every Galera cluster reformation attempt fails with an SSL handshake error, and the operator's autoRecovery mechanism — which depends on inter-node SSL communication to coordinate bootstrap — could never complete. This is why a PXC cluster can remain in FULL_PXC_CLUSTER_CRASH state for an extended period despite autoRecovery being enabled.

Additional conditions that can compound the outage and complicate recovery:

  • An OOMKill of the primary pxc pod triggering the cluster crash.

  • Accumulated xb-cron backup jobs simultaneously requesting full SSTs on recovery, spiking memory on the bootstrap pod.

  • A PXC pod stuck in Terminating state due to a stale VolumeAttachment record, blocking rescheduling.

Diagnostics

1

Check the state of Percona PXC and HAProxy pods

Verify the pod status in the management namespace. Unhealthy HAProxy pods running 1/2 and PXC pods with a high restart count confirm the cluster is in a degraded or crashed state.

The 1/2 count on HAProxy pods indicates the haproxy container is failing to connect to the Galera backend. A PXC pod in Terminating state for more than a few minutes signals a potential Multi-Attach PVC issue (see Step 3).

2

Confirm the FULL_PXC_CLUSTER_CRASH banner and identify the highest seqno

Check the logs of each PXC pod for the FULL_PXC_CLUSTER_CRASH banner and note the seqno reported by each. The pod with the highest seqno is the bootstrap source.

Note the seqno value for each pod. Proceed to Step 4 using the pod with the highest seqno.

3

Check for expired Galera SSL certificates

This is the most common reason autoRecovery fails silently for an extended period. Check the pxc logs for SSL handshake errors on port 4567.

Note: In practice, the SSL expiry errors tend to surface most prominently on replica pods (e.g. percona-db-pxc-db-pxc-1) as they are the first to attempt reconnection after the primary is lost. In this Knowledge Base, we are considering percona-db-pxc-db-pxc-1 as an example. Check all pod logs regardless.

If this error appears repeatedly, the Galera internal TLS certificates have expired. The cluster cannot auto-recover until the certificates are renewed (see Workaround Step 1). If this error is absent, proceed to Step 4.

4

Check for a PVC Multi-Attach error on a stuck Terminating pod

If any PXC pod has been in Terminating state for more than a few minutes, check for a Multi-Attach PVC error that is blocking rescheduling.

If the Multi-Attach warning is present, the PVC is still attached to the old node and must be cleared (see Workaround Step 2) before the bootstrap step.

5

Confirm resmgr MySQL connectivity failure

Verify that the loss of MySQL connectivity is the direct cause of hosts not appearing in the PCD UI.

If all checks above match, proceed to the Workaround section.

Workaround

The recovery procedure must be performed in the order below. Skipping the SSL certificate renewal step (Step 1) before attempting bootstrap will result in the cluster immediately re-entering the crash-wait loop.

1

Renew the expired Galera internal SSL/TLS certificates

If Step 3 of Diagnostics confirmed SSL certificate expiry errors, the certificates must be renewed before any bootstrap attempt. Attempting bootstrap with expired certificates will cause the cluster to re-enter the crash-wait loop.

The Galera internal TLS certificates are managed by the Percona XtraDB Cluster operator.

After the certificates are rotated, confirm that the SSL errors no longer appear in pxc-1 logs before proceeding to Step 2:

2

Resolve any stuck Terminating PXC pod

If a PXC pod is stuck in Terminating state due to a Multi-Attach PVC error (confirmed in Diagnostics Step 4), force-delete the pod and clear the stale VolumeAttachment record before attempting the bootstrap.

Identify the stale VolumeAttachment record for the affected PVC:

Force-delete the stuck pod:

Delete the stale VolumeAttachment record so the CSI driver can re-attach the volume to the correct node:

Wait for the pod to reschedule and reach Running state before continuing.

3

Delete accumulated xb-cron backup PVCs before bootstrapping

During an extended outage, the xb-cron backup operator accumulates PVCs for each failed backup run. When the cluster recovers, these jobs connect as Galera Arbitrators and request full SSTs simultaneously, overwhelming the bootstrap pod with concurrent xtrabackup donor streams and causing repeated OOMKills.

Check how many xb-cron PVCs have accumulated:

If more than two or three xb-cron pods in Error state are present, delete it before bootstrapping. This prevents the SST donor storm:

4

Identify the grastate.dat sequence number on each PXC pod

Read the Galera state file from each PXC pod to confirm the current seqno values. Select the pod with the highest seqno as the bootstrap source.

Note the seqno value from each pod. If safe_to_bootstrap is 0 on all pods (expected after a full crash), the pod with the highest seqno must be used in the next step.

5

Bootstrap the cluster from the pod with the highest seqno

Send the USR1 signal to the pxc container on the pod identified in Step 4. Replace <BOOTSTRAP_POD_NAME> with the pod name that reported the highest seqno (e.g., percona-db-pxc-db-pxc-0).

Monitor the bootstrap pod logs immediately after running the command:

The bootstrap pod starts mysqld as the Galera primary component. The remaining PXC pods join the cluster and resync. If OOMKills recur on the bootstrap pod immediately after this step, additional accumulated xb-cron SST requests are likely the cause — return to Step 3 and delete any remaining Error-state xb-cron pods, then rebootstrap.

6

Wait for all PXC pods to reach 3/3 Running

After the bootstrap, replica pods rejoin the cluster. During this period, pods may temporarily show 2/3 Running - this is expected while IST (Incremental State Transfer) completes and the request storm from dependent services resolves.

Monitor until all three pods reach 3/3 Running:

7

Verify HAProxy pods recover and hosts reappear in the PCD UI

Once all PXC pods are 3/3 Running, confirm HAProxy pods return to 2/2 Running:

Log in to the PCD UI and confirm hosts are visible and the management plane health indicator shows no degraded services.

8

Monitor PXC pod memory usage for 24 hours post-recovery

After recovery, monitor memory consumption on the PXC pods. The 6 Gi memory limit could be insufficient under SST donor load.

If memory usage on any pod exceeds 5 Gi, contact Platform9 Support to review the PXC memory limits.

Additional Information

The Engineering team is actively working on a couple of betterment JIRAs via PCD-7798 and PCD-7799, which are planned to be included in a future release. These tasks will address PXC Certificate Monitoring and overall resiliency.

Last updated