IP Reconciler CronJob Fails to Run Due to 100 Failed Attempts
Problem
The IP Reconciler cronjob is failing to start with the following events being recorded for the cronjob.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedNeedsStart 68m (x19668 over 2d7h) cronjob-controller Cannot determine if job needs to be started: too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew
Warning FailedNeedsStart 3m21s (x121 over 23m) cronjob-controller Cannot determine if job needs to be started: too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew
Environment
- Platform9 Edge Cloud - 5.3 LTS Patch #7: v-5.3.0-1739149 and below
- Whereabouts
Answer
The old CronJob controller had a hardcoded, and arbitrary 100 limit where it would stop scheduling the CronJob if it missed 100 total starting windows. Any kind of downtime will contribute to reaching this limit, including upgrades, control plane downtime/master node reboot tests, maintenance, and network outages, etc.
There is a new CronJob V2 controller that was added to Kubernetes upstream, which fixes this among other performance improvements. Link: 1.21-cronjob-ga.
This is only available as a GA feature as the default, starting K8s v1.21.
Starting v5.3 LTS Patch #8: v-5.3.0-1762883, we have enabled CronJob V2 controller feature gate on the kube-controller-manager container for K8s v1.20 clusters by default.
To manually enable this feature on K8s v1.20 clusters on 5.3 LTS Patch #7: v-5.3.0-1739149 and below, please find the below procedure.
The change will need to be performed on all the master nodes.
This is not supported on K8s v1.19.
systemctl stop pf9-hostagent
systemctl stop pf9-nodeletd
vim /opt/pf9/pf9-kube/conf/masterconfig/base/centos/master.yaml
Append - "--feature-gates=CronJobControllerV2=true"
to the command list for the kube-controller-manager container section. Here is how it should look:
---
apiVersion: "v1"
kind: "Pod"
metadata:
name: "k8s-master"
namespace: "kube-system"
spec:
hostNetwork: true
containers:
- name: "kube-controller-manager"
image: "k8s.gcr.io/kube-controller-manager:__KUBERNETES_VERSION__"
command:
- "kube-controller-manager"
- "--cloud-provider=__CLOUD_PROVIDER__"
- "--kubeconfig=/srv/kubernetes/kubeconfigs/kube-controller-manager.yaml"
- "--leader-elect=true"
- "--profiling=false"
- "--root-ca-file=/srv/kubernetes/certs/apiserver/ca.crt"
- "--service-account-private-key-file=/srv/kubernetes/certs/apiserver/svcacct.key"
- "--v=__DEBUG_LEVEL__"
- "--horizontal-pod-autoscaler-use-rest-clients=true"
- "--use-service-account-credentials=true"
- "--feature-gates=CronJobControllerV2=true"
/opt/pf9/nodelet/nodeletd phases restart
Running the above command will drain all pods/containers running on the node.
Wait for some time and then verify whether the k8s-master Pod is running. If you see any errors, check the logs from the kube-controller-manager container in the k8s-master pod.
systemctl start pf9-hostagent
On a multi-master cluster, ensure that the above steps are made to one master node at a time else ETCD will lose quorum and the cluster will be unreachable.
After another short while, you should see the CronJob has resumed scheduling again, via
kubectl get cronjob ip-reconciler -n kube-system
$ kubectl describe cronjob ip-reconciler -n kube-system
Name: ip-reconciler
Namespace: kube-system
Labels: app=whereabouts
tier=node
Annotations: <none>
Schedule: */5 * * * *
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 4m8s (x917 over 3d7h) cronjob-controller Created job ip-reconciler-27678015