Cluster Upgrade From 1.20 to 1.21 Is Getting Failed Due To ETCD Corruption.

Problem

ETCD environment variable entries are missing from the Sunpike host object during the cluster upgrade from 1.20 to 1.21 in PMK-5.5 Managemant Plane.

The nodelet logs are showing below errors:

Nodelet log
    
--- /opt/pf9/pf9-kube/phases/etcd_configure.sh start at 2023-04-24 17:42:40 ---[2023-04-24 17:42:40] Ensuring etcd data is stored on host[2023-04-24 17:42:40] Error: No such object: etcd[2023-04-24 17:42:40] Skipping; etcd container does not exist--- /opt/pf9/pf9-kube/phases/etcd_run.sh start at 2023-04-24 17:42:40 ---[2023-04-24 17:42:41] Node endpoint is 172.20.58.9[2023-04-24 17:42:41] Deriving local etcd environment[2023-04-24 17:42:41] Ensuring container 'etcd' is destroyed[2023-04-24 17:42:41] [2023-04-24 17:42:56] {"level":"warn","ts":"2023-04-24T15:42:56.941Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-57c1/localhost:4001","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Errorwhile dialing dial tcp 127.0.0.1:4001: connect: connection refused\""}[2023-04-24 17:42:56] https://localhost:4001 is unhealthy: failed to commit proposal: context deadline exceeded[2023-04-24 17:42:56] Error: unhealthy cluster[2023-04-24 17:42:57] Waiting for healthy etcd cluster.[2023-04-24 17:43:12] {"level":"warn","ts":"2023-04-24T15:43:12.410Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-444/localhost:4001","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Errorwhile dialing dial tcp 127.0.0.1:4001: connect: connection refused\""}[2023-04-24 17:43:12] https://localhost:4001 is unhealthy: failed to commit proposal: context deadline exceeded[2023-04-24 17:43:12] Error: unhealthy cluster
Copy

Environment

Platform9 Managed Kubernetes - v5.5.
Kubernetes version 1.20.

Answer

This is a known issue, Platform9 Engineering team is currently working on this case, expecting the fix for this issue on PMK-5.10 release.

Additional Information

To track the progress of the fix for this issue, open a support ticket mentioning the JiraID- PMK-5803.

Last updated on

Was this page helpful?