ETCD Backup Cronjob Fails and job pods Report the Status as 'NotReady'
Problem
The etcd-backup-with-interval
cronjob in thekube-system
namespace fails and the job pod created during the cron execution reports the status as'NotReady'
- ETCD debug logs show the below message:
transport: loopyWriter.run returning. connection error: desc = "transport is closing"
Environment
- Platform9 Managed Kubernetes - v-5.6.8 and Higher.
Cause
- ETCD uses gRPC calls and the error message means that the connection which the RPC was using, was closed.
- This can happen due to any of the below reasons:
- Mis-configured transport credentials, connection failed on handshaking.
- Bytes disrupted, possibly by a proxy in between.
- Server shutdown.
- Keepalive parameters caused connection shutdown, for example if you have configured your server to terminate connections regularly to trigger DNS lookups. If this is the case, you may want to increase your MaxConnectionAgeGrace, to allow longer RPC calls to finish.
- ETCD Leader Elections can also cause transient fails.
- ETCD took too long to process this request and eventually it hit a timeout.
Resolution
- List the
jobs
(not cronjobs) in thekube-system
namespace
$ kubectl get jobs -n kube-system
- Delete all the jobs that are reporting the status as
"0/1"
but are notCompleted
$ kubectl delete job <job-name> -n kube-system
Additional Information
- Currently, Catapult monitoring does trigger an alert if the job fails but does not trigger an alert in this case as the job is never failed but is running and failing.
- There is an existing bug reported internally for Catapult monitoring to send this alert as well - PMK-6340.
Was this page helpful?