ETCD Backup Cronjob Fails and job pods Report the Status as 'NotReady'

Problem

  • The etcd-backup-with-interval cronjob in the kube-system namespace fails and the job pod created during the cron execution reports the status as 'NotReady'
  • ETCD debug logs show the below message:

transport: loopyWriter.run returning. connection error: desc = "transport is closing"

Environment

  • Platform9 Managed Kubernetes - v-5.6.8 and Higher.

Cause

  • ETCD uses gRPC calls and the error message means that the connection which the RPC was using, was closed.
  • This can happen due to any of the below reasons:
    1. Mis-configured transport credentials, connection failed on handshaking.
    2. Bytes disrupted, possibly by a proxy in between.
    3. Server shutdown.
    4. Keepalive parameters caused connection shutdown, for example if you have configured your server to terminate connections regularly to trigger DNS lookups. If this is the case, you may want to increase your MaxConnectionAgeGrace, to allow longer RPC calls to finish.
    5. ETCD Leader Elections can also cause transient fails.
    6. ETCD took too long to process this request and eventually it hit a timeout.

Resolution

  • List the jobs (not cronjobs) in the kube-system namespace
List the jobs
Copy
  • Delete all the jobs that are reporting the status as "0/1" but are not Completed
Delete failed jobs
Copy

Additional Information

  • Currently, Catapult monitoring does trigger an alert if the job fails but does not trigger an alert in this case as the job is never failed but is running and failing.
  • There is an existing bug reported internally for Catapult monitoring to send this alert as well - PMK-6340.
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard