pf9-consul Service Logs Utilising More Diskspace Impacting the Workloads
Problem
- Observing diskspace exhaustion in the Hypervisors by pf9-consul logs with errors flooding as below:
2024/07/18 02:13:37 [ERR] consul.rpc: failed to accept RPC conn: accept tcp 16.228.38.14:8300: accept4: too many open files
2024/07/18 02:13:37 [ERR] memberlist: Error accepting TCP connection: accept tcp 16.228.38.14:8301: accept4: too many open files
Environment
- Platform9 Managed OpenStack - v5.8.2 and higher.
Answer
The errors are observed because the consul user does not have enough File Descriptor value to perform its operations. The workaround is to increase the file descriptor values.
Workaround
The current soft limit is set to 1024, and the hard limit is 4096 which is the maximum allowed. Consul has already used up these available File Descriptors, which is causing it to dump continuous error logs.
The Consul File Descriptor limit needs to be set two times higher than the expected number of clients in the cluster.
To fix this, it is required to increase the File Descriptor limit.
Increase the default File Descriptor limit per user using the following steps.
- Modify /etc/security/limits.conf.
- Add the following lines to /etc/security/limits.conf to set the file descriptor limits for all users.
* soft nofile 65536
* hard nofile 65536
If the file /etc/security/limits.conf is managed by Chef and local changes are reverted, then the customer needs to adjust the config with Chef accordingly.
Additional Information
- Platform9 team has opened a Jira IAAS-10787 to track this issue, and the mentioned changes will be reflected in the PMO-5.10.X release; ETA is by the end of September 2024.
- For more details, refer the official documentation: https://developer.hashicorp.com/consul/docs/architecture/scale