Troubleshooting Hostagent Service Failures.
Problem
After the same version upgrade on LTS1- patch 12.1 [v-5.3.0-2075501] setup. In the worker node the hostagent service is not coming up.
# systemctl status pf9-hostagent.service* pf9-hostagent.service - Platform9 Host Agent Service Loaded: loaded (/usr/lib/systemd/system/pf9-hostagent.service; enabled; vendor preset: disabled) Active: failed (Result: start-limit) since Mon 2023-08-28 08:49:56 UTC; 32min ago Process: 383 ExecStart=/bin/bash -c /opt/pf9/hostagent/bin/pf9-hostd >> /var/log/pf9/hostagent-daemon.log 2>&1 (code=exited, status=1/FAILURE) Process: 364 ExecStartPre=/opt/pf9/hostagent/pf9-hostagent-prestart.sh (code=exited, status=0/SUCCESS) Main PID: 383 (code=exited, status=1/FAILURE)After node reboot:
# systemctl status pf9-hostagent.service* pf9-hostagent.service - Platform9 Host Agent Service Loaded: loaded (/usr/lib/systemd/system/pf9-hostagent.service; enabled; vendor preset: disabled) Active: activating (auto-restart) (Result: exit-code) since Tue 2023-08-29 12:16:11 UTC; 18s ago Process: 2097 ExecStart=/bin/bash -c /opt/pf9/hostagent/bin/pf9-hostd >> /var/log/pf9/hostagent-daemon.log 2>&1 (code=exited, status=1/FAILURE) Process: 2058 ExecStartPre=/opt/pf9/hostagent/pf9-hostagent-prestart.sh (code=exited, status=0/SUCCESS) Main PID: 2097 (code=exited, status=1/FAILURE) CGroup: /system.slice/pf9-hostagent.serviceNo new logs are getting logged in the hostagent log, But checking the previous hostagent logs, the below entries are seen which indicates the hostagent service is unable to fetch the IP address associated with the nodes, since the fetch_ip_address.py script is not returning the expected output:
2023-06-19 04:39:38,136 - session.py ERROR - timeout 120 /opt/pf9/hostagent/extensions/fetch_ip_address.py command failed: Command '['timeout', '120', '/opt/pf9/hostagent/extensions/fetch_ip_address.py']' returned non-zero exit status 1.Environment
- Platform9 Edge Cloud - LTS1- patch 12.1 [v-5.3.0-2075501].
Answer
This is a known issue, the Platform9 Engineering team is investigating to identify the root cause and resolve it. In these scenarios it is observed that the fetch_ip_address.__py __script execution is failing, So it is recommended to share the below outputs from the customer environment
- Check if the IPs are getting populated if the fetch_ip_address.py script is manually executed as shown below:
[centos@test-pf9-airgap-c7-2842200-110-1 ~]$ /opt/pf9/hostagent/bin/python /opt/pf9/hostagent/extensions/fetch_ip_address.py["10.149.101.110", "192.168.122.1"][centos@test-pf9-airgap-c7-2842200-110-1 ~]$- Compare if the python libraries are same in the working and non-working nodes:
[centos@test-pf9-airgap-c7-2842200-110-1 ~]$ /opt/pf9/hostagent/bin/python -m pip listPackage Version------------------ ---------bbcommon 0.1bbslave 0.1certifi 2021.10.8cffi 1.15.0charset-normalizer 2.0.12configutils 0.1cryptography 2.8distro 1.7.0idna 3.3netifaces 0.10.6pf9app 0.1pika 0.13.1pip 22.0.4psutil 5.4.5py-cpuinfo 7.0.0pycparser 2.21PyYAML 5.3.1requests 2.27.1setuptools 61.2.0six 1.12.0urllib3 1.26.9wheel 0.37.1- Identify the extension [_fetch_ip_address.py_] in the error of the hostagent.log. And try to execute it manually as shown in the below snippet:
2023-06-19 04:39:38,136 - session.py ERROR - timeout 120 /opt/pf9/hostagent/extensions/fetch_ip_address.py command failed: Command '['timeout', '120', '/opt/pf9/hostagent/extensions/fetch_ip_address.py']' returned non-zero exit status 1.
[centos@test-pf9-airgap-c7-2842200-110-3 extensions]$ pwd/opt/pf9/hostagent/extensions[centos@test-pf9-airgap-c7-2842200-110-3 extensions]$ ls -ltotal 64-rwxr-xr-x. 1 pf9 pf9group 1967 Jul 1 2022 fetch_cloud_metadata-rwxr-xr-x. 1 root root 3764 Sep 5 12:16 fetch_cpu_stats.py-rwxr-xr-x. 1 pf9 pf9group 82 Jul 1 2022 fetch_firewalld_status-rwxr-xr-x. 1 root root 2096 Sep 5 12:16 fetch_interfaces.py-rwxr-xr-x. 1 root root 1299 Sep 5 12:16 fetch_ip_address.py-rwxr-xr-x. 1 pf9 pf9group 310 Jul 1 2022 fetch_kube_api_status-rwxr-xr-x. 1 pf9 pf9group 171 Jul 1 2022 fetch_kube_node_ready-rwxr-xr-x. 1 pf9 pf9group 335 Jul 1 2022 fetch_listened_ports-rwxrwx---. 1 pf9 pf9group 2483 Jul 1 2022 fetch_pf9_kube_status.py-rwxr-xr-x. 1 pf9 pf9group 143 Jul 1 2022 fetch_physical_nics-rwxrwx---. 1 pf9 pf9group 4461 Jul 1 2022 fetch_pod_info.py-rwxr-xr-x. 1 root root 958 Sep 5 12:16 fetch_resource_usage.py-rwxr-xr-x. 1 pf9 pf9group 164 Jul 1 2022 fetch_selinux_status-rwxr-xr-x. 1 root root 1501 Sep 5 12:16 fetch_volumes_present.py-rw-r--r--. 1 pf9 pf9group 2172 Jul 1 2022 service_status.sh[centos@test-pf9-airgap-c7-2842200-110-3 extensions]$ ./fetch_ip_address.py["192.168.122.1", "10.149.101.15"][centos@test-pf9-airgap-c7-2842200-110-3 extensions]$With the above three outputs, please reach out to the Platform9 Support Team with the Jira-ID AIR-1199 which is in place to track this issue.
The python library used to execute the fetch_ip_address.py script is /opt/pf9/hostagent/bin/python. This python library location can be seen in the systemctl status output of the pf9-hostagent service in the active nodes.
[centos@test-pf9-airgap-c7-2842200-110-1 ~]$ systemctl status pf9-hostagent.service● pf9-hostagent.service - Platform9 Host Agent Service Loaded: loaded (/usr/lib/systemd/system/pf9-hostagent.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2023-09-05 12:42:05 UTC; 5 days ago Main PID: 14463 (bash) Tasks: 2 Memory: 114.6M CGroup: /system.slice/pf9-hostagent.service ├─14463 /bin/bash -c /opt/pf9/hostagent/bin/pf9-hostd >> /var/log/pf9/hostagent-daemon.log 2>&1 └─14466 /opt/pf9/hostagent/bin/python /opt/pf9/hostagent/bin/pf9-hostd <------------Additional Information
Other hostagent related issues:
Hostagent Installation Failing Due To Apt Key Being Not Present..
Platform9 Related Package Installation Failing Due To Apt Cache Corruption..
Pf9-hostagent Failing With Error "No module named 'apt_pkg'".