IP Reconciler CronJob Fails to Run Due to 100 Failed Attempts

Problem

The IP Reconciler cronjob is failing to start with the following events being recorded for the cronjob.

Events:
  Type     Reason            Age                     From                Message
  ----     ------            ----                    ----                -------
  Warning  FailedNeedsStart  68m (x19668 over 2d7h)  cronjob-controller  Cannot determine if job needs to be started: too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew
  Warning  FailedNeedsStart  3m21s (x121 over 23m)   cronjob-controller  Cannot determine if job needs to be started: too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew

Environment

Platform9 Edge Cloud - 5.3 LTS Patch #7: v-5.3.0-1739149 and below
Whereabouts

Answer

The old CronJob controller had a hardcoded, and arbitrary 100 limit where it would stop scheduling the CronJob if it missed 100 total starting windows. Any kind of downtime will contribute to reaching this limit, including upgrades, control plane downtime/master node reboot tests, maintenance, and network outages, etc.

There is a new CronJob V2 controller that was added to Kubernetes upstream, which fixes this among other performance improvements. Link: 1.21-cronjob-ga.

This is only available as a GA feature as the default, starting K8s v1.21.

Fix

Starting v5.3 LTS Patch #8: v-5.3.0-1762883, we have enabled CronJob V2 controller feature gate on the kube-controller-manager container for K8s v1.20 clusters by default.

To manually enable this feature on K8s v1.20 clusters on 5.3 LTS Patch #7: v-5.3.0-1739149 and below, please find the below procedure.

Warning

The change will need to be performed on all the master nodes.

This is not supported on K8s v1.19.

systemctl stop pf9-hostagent
systemctl stop pf9-nodeletd

vim /opt/pf9/pf9-kube/conf/masterconfig/base/centos/master.yaml

Append - "--feature-gates=CronJobControllerV2=true" to the command list for the kube-controller-manager container section. Here is how it should look:

---
apiVersion: "v1"
kind: "Pod"
metadata:
name: "k8s-master"
namespace: "kube-system"
spec:
hostNetwork: true
containers:
- name: "kube-controller-manager"
image: "k8s.gcr.io/kube-controller-manager:__KUBERNETES_VERSION__"
command:
- "kube-controller-manager"
- "--cloud-provider=__CLOUD_PROVIDER__"
- "--kubeconfig=/srv/kubernetes/kubeconfigs/kube-controller-manager.yaml"
- "--leader-elect=true"
- "--profiling=false"
- "--root-ca-file=/srv/kubernetes/certs/apiserver/ca.crt"
- "--service-account-private-key-file=/srv/kubernetes/certs/apiserver/svcacct.key"
- "--v=__DEBUG_LEVEL__"
- "--horizontal-pod-autoscaler-use-rest-clients=true"
- "--use-service-account-credentials=true"
- "--feature-gates=CronJobControllerV2=true"

/opt/pf9/nodelet/nodeletd phases restart

Warning

Running the above command will drain all pods/containers running on the node.

Wait for some time and then verify whether the k8s-master Pod is running. If you see any errors, check the logs from the kube-controller-manager container in the k8s-master pod.

systemctl start pf9-hostagent

Warning

On a multi-master cluster, ensure that the above steps are made to one master node at a time else ETCD will lose quorum and the cluster will be unreachable.

After another short while, you should see the CronJob has resumed scheduling again, via

kubectl get cronjob ip-reconciler -n kube-system

$ kubectl describe cronjob ip-reconciler -n kube-system
Name:                          ip-reconciler
Namespace:                     kube-system
Labels:                        app=whereabouts
                               tier=node
Annotations:                   <none>
Schedule:                      */5 * * * *

Events:
  Type    Reason            Age                    From                Message
  ----    ------            ----                   ----                -------
  Normal  SuccessfulCreate  4m8s (x917 over 3d7h)  cronjob-controller  Created job ip-reconciler-27678015

PreviousAddOn Operator Pod Restarting due to Error "Use SANs or Temporarily Enable Common Name Matching with NextHow To Re-generate Certificates If Hostagent Certificates are Expired

Last updated 2 months ago

Good evening

hashtagProblem

hashtagEnvironment

hashtagAnswer

Problem

Environment

Answer