Cluster Upgrade From 1.20 to 1.21 Is Getting Failed Due To ETCD Corruption.
Problem
ETCD environment variable entries are missing from the Sunpike host object during the cluster upgrade from 1.20 to 1.21 in PMK-5.5 Managemant Plane.
The nodelet logs are showing below errors:
--- /opt/pf9/pf9-kube/phases/etcd_configure.sh start at 2023-04-24 17:42:40 ---
[2023-04-24 17:42:40] Ensuring etcd data is stored on host
[2023-04-24 17:42:40] Error: No such object: etcd
[2023-04-24 17:42:40] Skipping; etcd container does not exist
--- /opt/pf9/pf9-kube/phases/etcd_run.sh start at 2023-04-24 17:42:40 ---
[2023-04-24 17:42:41] Node endpoint is 172.20.58.9
[2023-04-24 17:42:41] Deriving local etcd environment
[2023-04-24 17:42:41] Ensuring container 'etcd' is destroyed
[2023-04-24 17:42:41]
[2023-04-24 17:42:56] {"level":"warn","ts":"2023-04-24T15:42:56.941Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invo
ker failed","target":"endpoint://client-57c1/localhost:4001","attempt":0,"error":"rpc error: code = DeadlineExc
eeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error
while dialing dial tcp 127.0.0.1:4001: connect: connection refused\""}
[2023-04-24 17:42:56] https://localhost:4001 is unhealthy: failed to commit proposal: context deadline exceeded
[2023-04-24 17:42:56] Error: unhealthy cluster
[2023-04-24 17:42:57] Waiting for healthy etcd cluster.
[2023-04-24 17:43:12] {"level":"warn","ts":"2023-04-24T15:43:12.410Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invo
ker failed","target":"endpoint://client-444/localhost:4001","attempt":0,"error":"rpc error: code = DeadlineExc
eeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error
while dialing dial tcp 127.0.0.1:4001: connect: connection refused\""}
[2023-04-24 17:43:12] https://localhost:4001 is unhealthy: failed to commit proposal: context deadline exceeded
[2023-04-24 17:43:12] Error: unhealthy cluster
Environment
- Platform9 Managed Kubernetes - v5.5.
- Kubernetes version 1.20.
Answer
This is a known issue, Platform9 Engineering team is currently working on this case, expecting the fix for this issue on PMK-5.10 release.
Additional Information
To track the progress of the fix for this issue, open a support ticket mentioning the JiraID- PMK-5803.
Was this page helpful?