Pod "etcd-backup-with-interval-" in "NotReady" State

Problem

One or more etcd-backup-with-interval- pods in the kube-system namespace are in a NotReady state, e.g.

$ kubectl get pods -n kube-system | grep etcd
etcd-backup-with-interval-28050660--1-kd7pg 1/2 NotReady 0 17m

The pod is associated with a job which is showing 0/1 completions, e.g.

$ kubectl get jobs -n kube-system | grep 28050660
etcd-backup-with-interval-28050660   0/1           245d       245d

The pod log shows only that it has created a temporary DB file with no further output, e.g.

{"level":"info","ts":1704221964.2963,"caller":"snapshot/v3_snapshot.go:119","msg":"created temporary db file","path":"/backup/etcd-snapshot-2024-01-02_18:59:24_UTC.db.part"}%

A kubectl describe job shows that the State is Running .

etcd-backup:
...
    Image:         gcr.io/etcd-development/etcd:v3.4.14
...
    Command:
      /bin/sh
    Args:
      -c
      etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y-%m-%d_%H:%M:%S_%Z).db
    State:          Running
      Started:      Tue, 02 Jan 2024 12:59:24 -0600
    Ready:          True
    Restart Count:  0

Environment

  • Platform9 Managed Kubernetes – v5.7 and Higher

Cause

The etcdctl snapshot save command is "hanging" or failing to complete as it is missing the following Environment section/variables which control the flags to be passed to the etcdctl command-line utility which are necessary for TLS authentication.

Thus, the job never transitions to Succeeded or Failed and the pod will continue to be re-created.

Resolution

  1. Delete the job which is associated with the pod .

  2. The pod will be terminated, and a new pod will be re-created associated with a new job resource which is using an updated spec template.

Last updated