The old CronJob controller had a hardcoded, and arbitrary 100 limit where it would stop scheduling the CronJob if it missed 100 total starting windows. Any kind of downtime will contribute to reaching this limit, including upgrades, control plane downtime/master node reboot tests, maintenance, and network outages, etc.
There is a new CronJob V2 controller that was added to Kubernetes upstream, which fixes this among other performance improvements. Link: 1.21-cronjob-ga.
This is only available as a GA feature as the default, starting K8s v1.21.
Fix
Starting v5.3 LTS Patch #8: v-5.3.0-1762883, we have enabled CronJob V2 controller feature gate on the kube-controller-manager container for K8s v1.20 clusters by default.
To manually enable this feature on K8s v1.20 clusters on 5.3 LTS Patch #7: v-5.3.0-1739149 and below, please find the below procedure.
Warning
The change will need to be performed on all the master nodes.
This is not supported on K8s v1.19.
Append - "--feature-gates=CronJobControllerV2=true" to the command list for the kube-controller-manager container section. Here is how it should look:
Warning
Running the above command will drain all pods/containers running on the node.
Wait for some time and then verify whether the k8s-master Pod is running. If you see any errors, check the logs from the kube-controller-manager container in the k8s-master pod.
Warning
On a multi-master cluster, ensure that the above steps are made to one master node at a time else ETCD will lose quorum and the cluster will be unreachable.
After another short while, you should see the CronJob has resumed scheduling again, via