Platform9 Services In Failed State Due To "Too many open files" Error
Problem
Platform9 services in failed state with 203/Exec error code, Checking the logs the "Too many open files in system." OSErrors are seen. The pf9-hostagent service failure scenario is shown in the below snippets:
[root@ttm10cm20-worker-20 ~]# systemctl status pf9-hostagent.service
* pf9-hostagent.service - Platform9 Host Agent Service
Loaded: loaded (/usr/lib/systemd/system/pf9-hostagent.service; enabled; vendor preset: disabled)
Active: failed (Result: start-limit) since Thu 2023-06-01 00:35:57 PDT; 2 weeks 0 days ago
Process: 280608 ExecStart=/bin/bash -c /opt/pf9/hostagent/bin/pf9-hostd >> /var/log/pf9/hostagent-daemon.log 2>&1 (code=exited, status=203/EXEC)
Process: 280600 ExecStartPre=/opt/pf9/hostagent/pf9-hostagent-prestart.sh (code=exited, status=0/SUCCESS)
Main PID: 280608 (code=exited, status=203/EXEC)
No opened files associated with the service pf9-hostagent:
[root@node ~]# lsof -p 280608
[root@node ~]# lsof -p 280600
[root@node ~]#
Hostagent logs complaining unable to load the configuration due to Too many open files with the system:
2023-06-01 09:26:53,364 - session.py ERROR - Failed to load desired configuration: [Errno 23] Too many open files in system: '/var/opt/pf9/hostagent/cb59738d-9293-46be-af40-41024c49c10e/desired_apps.json'
2023-06-01 09:26:53,364 - slave.py ERROR - IOError: [Errno 23] Too many open files in system
---
OSError: [Errno 23] Too many open files in system.
Environment
- Platform9 Managed Kubernetes - Any version.
- Platform9 Managed Openstack - Any version.
- Platform9 Edge Cloud - Any version.
Cause
There should be services/processes using most of the open file limits causing the other processes to run out of the limits affecting the service failure.
Resolution
It is important to identify that specific process using most count of opened files using. If the opened file count are legitimiate it is recommended to increase the ulimit count after consulting with the OS support team. Else, optimise the process to operate within the allotted file counts. In the below snippet, the lsof ouput will give the top 15 processes using most number of opened files:
# lsof | awk '{ print $1 " " $2; }' | sort -rn | uniq -c | sort -rn | head -15
If the lsof
command is still not working on the affected nodes, Try below script which will list the opened file count, process-name and process-id of the top 15 processes:
# ls -l /proc/ | awk '{print $9}' | grep -v "^[A-Za-z]" > files
# root@ip-10-0-1-154:~# cat files | wc -l
175 // Process count [All processes within the node]
# for i in $(cat files); do echo -e "============================= $i" ; ls -l /proc/$i/fdinfo/ | wc -l ; done | less
Sample output: [In the order opened file count, process-name and process-id]
root@ip-10-0-1-154:~# lsof | awk '{ print $1 " " $2; }' | grep -v "pwd" | sort -rn | uniq -c | sort -rn | head -15
2664 kube-apis 24378
1800 container 21494
1515 etcd 22292
867 kubelet 28012
825 pf9-comms 4610
742 container 24282
435 container 27527
406 container 34678
406 container 32929
395 udisksd 521
377 container 950147
363 pf9-sidek 4619
341 forward_b 1701840
292 pf9-hostd 4654
245 multipath 3767107
Additional Information
Since the ulimit counts are Operating System level critical values, it is recommended to make changes in these values after consulting with the OS support team of respective users.