Decco-consul-server pod in Error Status
Problem
The decco-consul-server pod is stuck in error status resulting the LTS3 [SMCP] management plane to be down
# kubectl get po -A | grep decco-consul
default decco-consul-consul-server-0 0/1 Error 1 (71s ago) 2m16s
Logs:
2024-03-01T16:14:01.402Z [ERROR] agent: startup error: error="refusing to rejoin cluster because server has been offline for more than the configured server_rejoin_age_max (168h0m0s) - consider wiping your data dir"
Environment
- Platform9 Self Managed Cloud Platform (SMCP) - v-5.9.1-3097398.
Cause
If the LTS3 [SMCP] setup is down for long time [days] due to issues like Disk/CPU related issue within the node [Master] in the Management cluster. And for the consul service the default value for "server_rejoin_age_max" in Consul is 7days . This parameter controls the maximum amount of time a server will wait before it rejoins the cluster after losing contact with a majority of the cluster.
Resolution
Steps:
1. Edit the decco-consul-consul-server-config
configmap and add the server_rejoin_age_max: 2592000s
[30days from default 7 days] parameter under the extra-from-values.json
section as shown below,
[root@test-pf9-du-host-airgap]# k edit cm decco-consul-consul-server-config
kind: ConfigMap
metadata:
name: decco-consul-consul-server-config
namespace: default
apiVersion: v1
data:
central-config.json: |-
{
"enable_central_service_config": true
}
extra-from-values.json: |- # Newly added lines
{ # Newly added lines
"server_rejoin_age_max": "2592000s" # Newly added lines
} # Newly added lines
The server_rejoin_age_max
value can be set depending upon the duration of downtime.
2. Now restart the below pods in order mentioned here:
# kubectl delete pod/decco-consul-consul-connect-injector-f9c54d6cc-xmg4n
# kubectl delete pod/decco-consul-consul-webhook-cert-manager-6866774b8b-l2mn8
# kubectl delete pod/decco-consul-consul-server-0
3. Once done verify the all the cosul related services are active:
[root@test-pf9-du-host-airgap]# kubectl get all -A | grep consul
default pod/decco-consul-consul-connect-injector-f9c54d6cc-fqfqt 1/1 Running 0 71m
default pod/decco-consul-consul-server-0 1/1 Running 0 71m
default pod/decco-consul-consul-webhook-cert-manager-6866774b8b-8j67w 1/1 Running 0 71m
default service/decco-consul-consul-connect-injector ClusterIP 10.21.3.14 <none> 443/TCP 2d19h
default service/decco-consul-consul-dns ClusterIP 10.21.0.128 <none> 53/TCP,53/UDP 2d19h
default service/decco-consul-consul-server ClusterIP None <none> 8500/TCP,8502/TCP,8301/TCP,8301/UDP,8302/TCP,8302/UDP,8300/TCP,8600/TCP,8600/UDP 25d
default service/decco-consul-consul-ui ClusterIP 10.21.2.154 <none> 80/TCP 25d
default deployment.apps/decco-consul-consul-connect-injector 1/1 1 1 2d19h
default deployment.apps/decco-consul-consul-webhook-cert-manager 1/1 1 1 2d19h
default replicaset.apps/decco-consul-consul-connect-injector-f9c54d6cc 1 1 1 2d19h
default replicaset.apps/decco-consul-consul-webhook-cert-manager-6866774b8b 1 1 1 2d19h
default statefulset.apps/decco-consul-consul-server 1/1 25d