Host Aggregates Not Showing Availability Status in UI, Instead Displaying "Hosts are Loading"
Problem
The Host Aggregates of the clusters displayed the status "Hosts are loading" instead of the expected availability status in the Platform9 UI. High Availability (HA) functionality appeared to be affected.
Environment
- Platform9 Managed OpenStack - v5.10.1 and Higher
- Component -
pf9-hamgr
on the control plane
Cause
The pf9-hamgr
service hit the file descriptor limit, which caused:
- Failure to read certificate files required for operation
- Broken connectivity to the Nova API
- The HA Manager service becoming unresponsive
- CURL requests to HA Manager endpoint returning 502 errors
This results in the UI being unable to retrieve and display the Host Aggregate status properly.
The following diagnosis, resolution steps involve commands that must be run on the Platform9 control plane. Please contact Platform9 Support for assistance.
Diagnosis
- From the Management Plane, the
/var/log/pf9/hamgr/hamgr.log
logs showed:
OSError: [Errno 24] Too many open files: '/etc/pf9/hamgr/certs/ca/ca.cert.pem'
Failed to establish a new connection: [Errno 24] Too many open files'
- The
curl
requests to HA reports below 502 error:
$ curl -kv https://<FQDN>/hamgr/v1/ha
< HTTP/1.1 502 Bad Gateway
- The
pf9-hamgr-server
process was not listening on the expected port , indicating it had stopped responding:
$ netstat -tulpn | grep <hamgr-process-id>
[no output]
- Too many open files by hamgr process:
$ lsof -p <hamgr-process-id> | wc -l
648
Resolution
Restarting the HA Manager service, resolves the issue:
# systemctl restart pf9-hamgr-server.service
Post-restart:
- The service resumed listening on port
- Open file descriptors dropped to normal operating levels
- Error logs stopped appearing
Validation
From the Management Plane:
After the restart:
- Verified that
pf9-hamgr-server
is listening on port :
$ netstat -tulpn | grep <hamgr-process-id>
tcp 0 0 0.0.0.0:[PORT-NUMBER] 0.0.0.0:* LISTEN [HAMGR-PROCESS-ID]/python
- Check the number of open files:
$ lsof -p <hamgr-process-id> | wc -l
110
From the UI:
In the UI, the user can verify the resolution by navigating to the Infrastructure