Platform9 Services In Failed State Due To "Too many open files" Error
Problem
Platform9 services in failed state with 203/Exec error code, Checking the logs the "Too many open files in system." OSErrors are seen. The pf9-hostagent service failure scenario is shown in the below snippets:
[root@ttm10cm20-worker-20 ~]# systemctl status pf9-hostagent.service * pf9-hostagent.service - Platform9 Host Agent Service Loaded: loaded (/usr/lib/systemd/system/pf9-hostagent.service; enabled; vendor preset: disabled) Active: failed (Result: start-limit) since Thu 2023-06-01 00:35:57 PDT; 2 weeks 0 days ago Process: 280608 ExecStart=/bin/bash -c /opt/pf9/hostagent/bin/pf9-hostd >> /var/log/pf9/hostagent-daemon.log 2>&1 (code=exited, status=203/EXEC) Process: 280600 ExecStartPre=/opt/pf9/hostagent/pf9-hostagent-prestart.sh (code=exited, status=0/SUCCESS) Main PID: 280608 (code=exited, status=203/EXEC)No opened files associated with the service pf9-hostagent:
[root@node ~]# lsof -p 280608[root@node ~]# lsof -p 280600[root@node ~]#Hostagent logs complaining unable to load the configuration due to Too many open files with the system:
2023-06-01 09:26:53,364 - session.py ERROR - Failed to load desired configuration: [Errno 23] Too many open files in system: '/var/opt/pf9/hostagent/cb59738d-9293-46be-af40-41024c49c10e/desired_apps.json'2023-06-01 09:26:53,364 - slave.py ERROR - IOError: [Errno 23] Too many open files in system---OSError: [Errno 23] Too many open files in system.Environment
- Platform9 Managed Kubernetes - Any version.
- Platform9 Managed Openstack - Any version.
- Platform9 Edge Cloud - Any version.
Cause
There should be services/processes using most of the open file limits causing the other processes to run out of the limits affecting the service failure.
Resolution
It is important to identify that specific process using most count of opened files using. If the opened file count are legitimiate it is recommended to increase the ulimit count after consulting with the OS support team. Else, optimise the process to operate within the allotted file counts. In the below snippet, the lsof ouput will give the top 15 processes using most number of opened files:
# lsof | awk '{ print $1 " " $2; }' | sort -rn | uniq -c | sort -rn | head -15If the lsof command is still not working on the affected nodes, Try below script which will list the opened file count, process-name and process-id of the top 15 processes:
# ls -l /proc/ | awk '{print $9}' | grep -v "^[A-Za-z]" > files# root@ip-10-0-1-154:~# cat files | wc -l175 // Process count [All processes within the node]# for i in $(cat files); do echo -e "============================= $i" ; ls -l /proc/$i/fdinfo/ | wc -l ; done | lessSample output: [In the order opened file count, process-name and process-id]
root@ip-10-0-1-154:~# lsof | awk '{ print $1 " " $2; }' | grep -v "pwd" | sort -rn | uniq -c | sort -rn | head -152664 kube-apis 243781800 container 214941515 etcd 22292867 kubelet 28012825 pf9-comms 4610742 container 24282435 container 27527406 container 34678406 container 32929395 udisksd 521377 container 950147363 pf9-sidek 4619341 forward_b 1701840292 pf9-hostd 4654245 multipath 3767107Additional Information
Since the ulimit counts are Operating System level critical values, it is recommended to make changes in these values after consulting with the OS support team of respective users.