Exec Probe Timeout Fixed from K8s v1.20 Resulting in Calico Pods to Fail Liveness/Readiness Probes a

Problem

Exec probe timeout fixed from K8s v1.20 resulting in calico pods to fail liveness/readiness probes as default timeout is 1 second.

Environment

  • Platform9 Managed Kubernetes - K8s v1.20 and above

  • Calico CNI v3.18

Answer

  • Both the calico-kube-controller pod and all the calico-node pods, the probe failures are taking place as starting from K8s v1.20, the exec probe timeouts were fixed to actually respect the timeout values. The default 1 second timeout value can prove to be too low for loaded clusters.

circle-info

Info

From Calico v3.20 they increased probe timeouts from 1s to 10s. Part of the reason is that in k8s v1.20, exec probe timeouts were fixedarrow-up-right to actually respect the timeout values.

# kubectl describe pod calico-node-25d6p -n kube-system
Name:                 calico-node-25d6p
Containers:
  calico-node:
   Liveness:   exec [/bin/calico-node -felix-live -bird-live] delay=10s timeout=1s period=10s #success=1 #failure=6
    Readiness:  exec [/bin/calico-node -felix-ready -bird-ready] delay=0s timeout=1s period=10s #success=1 #failure=3
 Events:
  Type     Reason     Age                  From     Message
  ----     ------     ----                 ----     -------
  Warning  Unhealthy  6m5s (x124 over 9d)  kubelet  Readiness probe failed:

Workaround Option 1

  • On ALL master nodes part of the cluster, add parameter timeoutSeconds: 10in files /opt/pf9/pf9-kube/conf/networkapps/calico-v1.20.11.yaml & /opt/pf9/pf9-kube/conf/networkapps/calico-v1.20.11-configured.yaml respectively at 2 sections i.e in Probes for calico-kube-controllers Deployment & calico-node DaemonSet Spec.

  • Perform complete PMK stack restart on each Master Node ONE NODE AT A TIME.

  • Post stack restart, the calico-kube-controller pod and all the calico-node pods will be recreated with spec consisting of timeoutSeconds=10.

circle-info

Note

The above procedure will not persist post cluster upgrade.

Workaround Option 2

Set ExecProbeTimeout: false as a feature gate in the Dynamic Kubelet Config of the nodes. This reverts to the previous behaviour where the timeouts are ignored in K8s v1.19 and below.

circle-info

Info

If changes are made to the configmap, all nodes running the pf9-kubelet service that use the configuration will detect the changes in the dynamic kubelet configuration and then integrate those changes into the ConfigMap settings and then restart the pf9-kubelet service. Once restarted, the pf9-kubelet service will use the new configuration in the ConfigMap.

Reference:

Last updated