Decco-consul-server pod in Error Status

Problem

The decco-consul-server pod is stuck in error status resulting the LTS3 [SMCP] management plane to be down

#  kubectl get po -A | grep decco-consul
default      decco-consul-consul-server-0         0/1     Error         1 (71s ago)       2m16s

Logs:

2024-03-01T16:14:01.402Z [ERROR] agent: startup error: error="refusing to rejoin cluster because server has been offline for more than the configured server_rejoin_age_max (168h0m0s) - consider wiping your data dir"

Environment

  • Platform9 Self Managed Cloud Platform (SMCP) - v-5.9.1-3097398.

Cause

If the LTS3 [SMCP] setup is down for long time [days] due to issues like Disk/CPU related issue within the node [Master] in the Management cluster. And for the consul service the default value for "server_rejoin_age_max" in Consul is 7days . This parameter controls the maximum amount of time a server will wait before it rejoins the cluster after losing contact with a majority of the cluster.

Resolution

Steps:

1. Edit the decco-consul-consul-server-config configmap and add the server_rejoin_age_max: 2592000s [30days from default 7 days] parameter under the extra-from-values.json section as shown below,

[root@test-pf9-du-host-airgap]# k edit cm decco-consul-consul-server-config
kind: ConfigMap
metadata:
  ...
  name: decco-consul-consul-server-config
  namespace: default
...
apiVersion: v1
data:
  central-config.json: |-
    {
      "enable_central_service_config": true
    }
  extra-from-values.json: |-                              # Newly added lines
    {                                                     # Newly added lines
     "server_rejoin_age_max": "2592000s"                  # Newly added lines
     }                                                    # Newly added lines
 ...
circle-info

Info

The server_rejoin_age_max value can be set depending upon the duration of downtime.

2. Now restart the below pods in order mentioned here:

3. Once done verify the all the cosul related services are active:

Last updated