Upgrading k8s cluster from v1.19 to v1.20 fails to enable monitoring (OLM pods)
Problem
Upon upgrading Platform9 Managed Kubernetes cluster from v1.19 -> v1.20 and finally to v1.21 the old OLM pods are not getting cleaned up and this breaks the enabling monitoring. Post upgrade to v1.21 the Grafana link is not visible.
Environment
- Platform9 Managed Kubernetes - v1.19.0 and Lower.
Cause
If the customer had deployed monitoring on a cluster (lets say 1.19) with Appbert, before OLM was removed, then pods would be deployed in both pf9-olm and pf9-monitoring namespace.
Now if the customer upgrades to 1.20, the old pods would still be around, because we don’t upgrade the monitoring stack once deployed.
The same cluster if now upgraded to 1.21, then we try to auto-migrate monitoring from Appbert to addon mgmt. This fails due to the presence of OLM pods.
The upgrade path in pf9-kube to 1.21, creates a ClusterAddon object for monitoring, but it errors out during deployment due to the presence of OLM pods. The UI is now trying to check the health of the ClusterAddon object, before showing the Grafana link, but since it has errored out, the UI does not show the Grafana link, even through the monitoring stack is up and running.
Resolution
Addon operator Fix
The way to fix it, would be to detect if a OLM is still around before deploying monitoring through addon mgmt, deleting OLM and pf9-monitoring pods, and then continuing with rest of the monitoring stack. Note that this would also mean that customisations would be wiped out while migrating to 1.21.
This will be fixed in 5.5 and the fix would be pushed back to 1.21. So the new pf9-kube:1.21 we will release for 5.4 should have this fix. Will update this page with the build number once we have it.
Workaround
The possible workaround is to:
Delete the monitoring stack on a 1.20 cluster before upgrading to 1.21 from the UI.
- This way Appbert will remove all pods (both in pf9-olm and pf9-monitoring), make sure there are no pods in pf9-olm and pf9-monitoring.
Upgrade to 1.21, and install monitoring again, through edit cluster
- This way addon mgmt will deploy monitoring and the grafana link should be visible.
Note that in case the user has made some customisations, like Alertmanagers targets, prometheus rules etc, they need to be backed up, because they will be wiped out when appbert removes the old stack.
Addiitonal Information
If the User has already tried upgrading to pf9-kube 1.21 and is in an error state:
- Remove monitoring ClusterAddon object from the DU, it will already be in error state, so can be removed instantly, won't block for a finalizer
- Remove monitoring Addon object from the cluster:
kubectl -n pf9-addons get addons | grep monitoring
The addon operator will remove OLM and pf9-monitoring namespace.