IP Reconciler CronJob Fails to Run Due to 100 Failed Attempts

Problem

The IP Reconciler cronjob is failing to start with the following events being recorded for the cronjob.

Events:
  Type     Reason            Age                     From                Message
  ----     ------            ----                    ----                -------
  Warning  FailedNeedsStart  68m (x19668 over 2d7h)  cronjob-controller  Cannot determine if job needs to be started: too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew
  Warning  FailedNeedsStart  3m21s (x121 over 23m)   cronjob-controller  Cannot determine if job needs to be started: too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew

Environment

  • Platform9 Edge Cloud - 5.3 LTS Patch #7: v-5.3.0-1739149 and below

  • Whereabouts

Answer

The old CronJob controller had a hardcoded, and arbitrary 100 limit where it would stop scheduling the CronJob if it missed 100 total starting windows. Any kind of downtime will contribute to reaching this limit, including upgrades, control plane downtime/master node reboot tests, maintenance, and network outages, etc.

There is a new CronJob V2 controller that was added to Kubernetes upstream, which fixes this among other performance improvements. Link: 1.21-cronjob-gaarrow-up-right.

This is only available as a GA feature as the default, starting K8s v1.21.

circle-check

To manually enable this feature on K8s v1.20 clusters on 5.3 LTS Patch #7: v-5.3.0-1739149 and below, please find the below procedure.

circle-exclamation

Append - "--feature-gates=CronJobControllerV2=true" to the command list for the kube-controller-manager container section. Here is how it should look:

circle-exclamation

Wait for some time and then verify whether the k8s-master Pod is running. If you see any errors, check the logs from the kube-controller-manager container in the k8s-master pod.

circle-exclamation

After another short while, you should see the CronJob has resumed scheduling again, via

Last updated