Decco-consul-server pod in Error Status

Problem

The decco-consul-server pod is stuck in error status resulting the LTS3 [SMCP] management plane to be down

Javascript
Copy

Logs:

Decco-consul-server pod logs
Copy

Environment

  • Platform9 Self Managed Cloud Platform (SMCP) - v-5.9.1-3097398.

Cause

If the LTS3 [SMCP] setup is down for long time [days] due to issues like Disk/CPU related issue within the node [Master] in the Management cluster. And for the consul service the default value for "server_rejoin_age_max" in Consul is 7days . This parameter controls the maximum amount of time a server will wait before it rejoins the cluster after losing contact with a majority of the cluster.

Resolution

Steps:

1. Edit the decco-consul-consul-server-config configmap and add the server_rejoin_age_max: 2592000s [30days from default 7 days] parameter under the extra-from-values.json section as shown below,

ConfigMap
Copy

The server_rejoin_age_max value can be set depending upon the duration of downtime.

2. Now restart the below pods in order mentioned here:

Restart pods
Copy

3. Once done verify the all the cosul related services are active:

Get resource info
Copy
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard