Exec Probe Timeout Fixed from K8s v1.20 Resulting in Calico Pods to Fail Liveness/Readiness Probes as Default Timeout is 1 second

Problem

Exec probe timeout fixed from K8s v1.20 resulting in calico pods to fail liveness/readiness probes as default timeout is 1 second.

Environment

Platform9 Managed Kubernetes - K8s v1.20 and above
Calico CNI v3.18

Answer

Both the calico-kube-controller pod and all the calico-node pods, the probe failures are taking place as starting from K8s v1.20, the exec probe timeouts were fixed to actually respect the timeout values. The default 1 second timeout value can prove to be too low for loaded clusters.

From Calico v3.20 they increased probe timeouts from 1s to 10s. Part of the reason is that in k8s v1.20, exec probe timeouts were fixed to actually respect the timeout values.

Describe Example
    
 
# kubectl describe pod calico-node-25d6p -n kube-systemName:                 calico-node-25d6pContainers:  calico-node:   Liveness:   exec [/bin/calico-node -felix-live -bird-live] delay=10s timeout=1s period=10s #success=1 #failure=6    Readiness:  exec [/bin/calico-node -felix-ready -bird-ready] delay=0s timeout=1s period=10s #success=1 #failure=3 Events:  Type     Reason     Age                  From     Message  ----     ------     ----                 ----     -------  Warning  Unhealthy  6m5s (x124 over 9d)  kubelet  Readiness probe failed:
Copy

Kubelet Logs
    
E0113 05:24:17.195056   23705 remote_runtime.go:392] ExecSync 4b779e0770fe8633778d71732591687263be8502b7444c95560d44bc61505d72 '/bin/calico-node -felix-live -bird-live' from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceededI0113 05:24:17.195205   23705 prober.go:117] Liveness probe for "calico-node-25d6p_kube-system(3f482be0-6e69-4dfd-9553-d6166f23ab75):calico-node" failed (failure):
Copy

Workaround Option 1

On ALL master nodes part of the cluster, add parameter timeoutSeconds: 10in files /opt/pf9/pf9-kube/conf/networkapps/calico-v1.20.11.yaml & /opt/pf9/pf9-kube/conf/networkapps/calico-v1.20.11-configured.yaml respectively at 2 sections i.e in Probes for calico-kube-controllers Deployment & calico-node DaemonSet Spec.
Perform complete PMK stack restart on each Master Node ONE NODE AT A TIME.

Stack Restart Steps
    
 
sudo systemctl stop pf9-hostagent pf9-nodeletdsudo /opt/pf9/nodelet/nodeletd phases stopsudo systemctl start pf9-hostagent
Copy

Post stack restart, the calico-kube-controller pod and all the calico-node pods will be recreated with spec consisting of timeoutSeconds=10.

Post Restart: New Applied Spec [Note the addition of timeoutSeconds: 10 in respective sections]
    
​x
 
# kubectl get deployment calico-kube-controllers -n kube-system -o yaml ...         readinessProbe:          exec:            command:            - /usr/bin/check-status            - -r          failureThreshold: 3          periodSeconds: 10          successThreshold: 1          timeoutSeconds: 10​# kubectl get daemonset calico-node -n kube-system -o yaml...    spec:      containers:...            livenessProbe:          exec:            command:            - /bin/calico-node            - -felix-live            - -bird-live          failureThreshold: 6          initialDelaySeconds: 10          periodSeconds: 10          successThreshold: 1          timeoutSeconds: 10        name: calico-node        readinessProbe:          exec:            command:            - /bin/calico-node            - -felix-ready            - -bird-ready          failureThreshold: 3          periodSeconds: 10          successThreshold: 1          timeoutSeconds: 10
Copy

The above procedure will not persist post cluster upgrade.

Workaround Option 2

Set ExecProbeTimeout: false as a feature gate in the Dynamic Kubelet Config of the nodes. This reverts to the previous behaviour where the timeouts are ignored in K8s v1.19 and below.

If changes are made to the configmap, all nodes running the pf9-kubelet service that use the configuration will detect the changes in the dynamic kubelet configuration and then integrate those changes into the ConfigMap settings and then restart the pf9-kubelet service. Once restarted, the pf9-kubelet service will use the new configuration in the ConfigMap.

Reference:

Last updated on

Was this page helpful?