IP Reconciler CronJob Fails to Run Due to 100 Failed Attempts

Problem

The IP Reconciler cronjob is failing to start with the following events being recorded for the cronjob.

Describe Output
    
Events:  Type     Reason            Age                     From                Message  ----     ------            ----                    ----                -------  Warning  FailedNeedsStart  68m (x19668 over 2d7h)  cronjob-controller  Cannot determine if job needs to be started: too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew  Warning  FailedNeedsStart  3m21s (x121 over 23m)   cronjob-controller  Cannot determine if job needs to be started: too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew
Copy

Environment

Platform9 Edge Cloud - 5.3 LTS Patch #7: v-5.3.0-1739149 and below
Whereabouts

The old CronJob controller had a hardcoded, and arbitrary 100 limit where it would stop scheduling the CronJob if it missed 100 total starting windows. Any kind of downtime will contribute to reaching this limit, including upgrades, control plane downtime/master node reboot tests, maintenance, and network outages, etc.

There is a new CronJob V2 controller that was added to Kubernetes upstream, which fixes this among other performance improvements. Link: 1.21-cronjob-ga.

This is only available as a GA feature as the default, starting K8s v1.21.

Starting v5.3 LTS Patch #8: v-5.3.0-1762883, we have enabled CronJob V2 controller feature gate on the kube-controller-manager container for K8s v1.20 clusters by default.

To manually enable this feature on K8s v1.20 clusters on 5.3 LTS Patch #7: v-5.3.0-1739149 and below, please find the below procedure.

The change will need to be performed on all the master nodes.

This is not supported on K8s v1.19.

Command
    
​x
 
systemctl stop pf9-hostagentsystemctl stop pf9-nodeletd​vim /opt/pf9/pf9-kube/conf/masterconfig/base/centos/master.yaml
Copy

Append - "--feature-gates=CronJobControllerV2=true" to the command list for the kube-controller-manager container section. Here is how it should look:

kube-controller-manager Container Spec
    
 
---apiVersion: "v1"kind: "Pod"metadata:name: "k8s-master"namespace: "kube-system"spec:hostNetwork: truecontainers:- name: "kube-controller-manager"image: "k8s.gcr.io/kube-controller-manager:__KUBERNETES_VERSION__"command:- "kube-controller-manager"- "--cloud-provider=__CLOUD_PROVIDER__"- "--kubeconfig=/srv/kubernetes/kubeconfigs/kube-controller-manager.yaml"- "--leader-elect=true"- "--profiling=false"- "--root-ca-file=/srv/kubernetes/certs/apiserver/ca.crt"- "--service-account-private-key-file=/srv/kubernetes/certs/apiserver/svcacct.key"- "--v=__DEBUG_LEVEL__"- "--horizontal-pod-autoscaler-use-rest-clients=true"- "--use-service-account-credentials=true"- "--feature-gates=CronJobControllerV2=true"
Copy

Command
    
 
/opt/pf9/nodelet/nodeletd phases restart
Copy

Running the above command will drain all pods/containers running on the node.

Wait for some time and then verify whether the k8s-master Pod is running. If you see any errors, check the logs from the kube-controller-manager container in the k8s-master pod.

Command
    
 
systemctl start pf9-hostagent
Copy

On a multi-master cluster, ensure that the above steps are made to one master node at a time else ETCD will lose quorum and the cluster will be unreachable.

After another short while, you should see the CronJob has resumed scheduling again, via

Command
    
 
kubectl get cronjob ip-reconciler -n kube-system
Copy

Describe Output
    
 
$ kubectl describe cronjob ip-reconciler -n kube-systemName:                          ip-reconcilerNamespace:                     kube-systemLabels:                        app=whereabouts                               tier=nodeAnnotations:                   <none>Schedule:                      */5 * * * *​Events:  Type    Reason            Age                    From                Message  ----    ------            ----                   ----                -------  Normal  SuccessfulCreate  4m8s (x917 over 3d7h)  cronjob-controller  Created job ip-reconciler-27678015
Copy

Last updated on

Was this page helpful?

IP Reconciler CronJob Fails to Run Due to 100 Failed Attempts

Problem

Environment

Answer