While Restoring LTS2-Patch2 On SMCP, Management Plane Cluster Backup-Restore Process Fails.
Problem
During the restoration process of LTS2-patch2 [v-5.6.7-2624593] to SMCP, the restore step is failing with below error:
# airctl restore --backupdir /root/ --config /opt/pf9/airctl/conf/airctl-config.yaml --verbose
2023-09-22T14:01:25.353Z info restoring mysql
2023-09-22T14:01:25.353Z info state file does not contain SSH user
▀ Starting vault (6m28s)2023-09-22T14:01:25.456Z debug found pod percona-db-pxc-db-pxc-0
ERROR setting up kplane
2023-09-22T14:01:34.475Z error failed to install kplane components: failed to install helm chart /sbin/helm install kplane-usermgr /opt/pf9/airctl/conf/helm_charts/kplane-components-0.3.4.tgz -f /opt/pf9/airctl/conf/kplane_values.yaml -f /opt/pf9/airctl/conf/secrets.yaml: exit status 1 - Error: INSTALLATION FAILED: execution error at (kplane-components/templates/required.yaml:17:5): consul_fallback_token is required from values.yaml
2023-09-22T14:01:34.475Z fatal error: failed to install helm chart /sbin/helm install kplane-usermgr /opt/pf9/airctl/conf/helm_charts/kplane-components-0.3.4.tgz -f /opt/pf9/airctl/conf/kplane_values.yaml -f /opt/pf9/airctl/conf/secrets.yaml: exit status 1 - Error: INSTALLATION FAILED: execution error at (kplane-components/templates/required.yaml:17:5): consul_fallback_token is required from values.yaml
Environment
- Platform9 Edge Cloud- LTS2-Patch2 [v-5.6.7-2624593].
Cause
This is a known issue. Jira AIR-1218 has been filed to track and resolve it.
Platform9 Engineering team is actively working to fix this issue.
Workaround
As a workaround, please follow the steps mentioned below:
- Ensure your existing DU has no issues by running the following command and verifying that task state is ready
airctl status
2. Download LTS2-Patch4 [v-5.6.7-2658688] artifacts, following same steps as for LTS2-Patch#2
bash ./install.sh v-5.6.7-2658688
- Run the upgrade operation following the upgrade guide. (Upgrade from LTS2-patch#2 to LTS2-patch#4)
The upgrade operation is expected to fail due a known issue which can be ignored. The upgrade, however it fails, fixes the state files which are essential for the restoration of LTS2 on SMCP. But the upgrade from LTS2-patch#2 to LTS2-patch#4 is affected due to removal of internal component known as decco and some related codebase changes.
The expected error message is shown below:
- After this, please follow restore process of smcp with following change:
In step#7, while updating the nodelet-bootstrap.yaml file add the kubedu-imgs tar file from LTS2-Patch#2 to the userImages section as well. A snippet of the yaml file shown below for reference:
isAirgappedtrue
systemImages
/opt/pf9/airctl/imgs/kubedu-imgs-v-5.9.0-2847602.tar.gz
/opt/pf9/airctl/imgs/nodelet-imgs-v-5.9.0-2847602.tar.gz
userImages
/home/centos/patch2/kubedu-imgs-v-5.6.7-2624593.tar.gz
/home/centos/patch4/kubedu-imgs-v-5.6.7-2658688.tar.gz
Additional Information
In some cases, especially on systems with limited resources, the container runtime can perform a garbage collection of some of the kubedu images which have not been used yet. This can cause some of the operations like airctl upgrade/upgrade-hosts to fail due to ImagePullBackOff errors.
We can determine whether the images need to be reloaded by running and making sure the images that we need for du-upgrade or host-upgrade have not been cleaned up.
For reference, some of the images we should look for are quay.io/platform9/k8s-helm-runner and quay.io/platform9/kplane-host-upg.
If we find that images are missing we can run the following command before the upgrade/upgrade-hosts operations