Platform9 Services In Failed State Due To "Too many open files" Error

Problem

Platform9 services in failed state with 203/Exec error code, Checking the logs the "Too many open files in system." OSErrors are seen. The pf9-hostagent service failure scenario is shown in the below snippets:

Service failure
Copy

No opened files associated with the service pf9-hostagent:

Opened files
Copy

Hostagent logs complaining unable to load the configuration due to Too many open files with the system:

Hostagent logs
Copy

Environment

  • Platform9 Managed Kubernetes - Any version.
  • Platform9 Managed Openstack - Any version.
  • Platform9 Edge Cloud - Any version.

Cause

There should be services/processes using most of the open file limits causing the other processes to run out of the limits affecting the service failure.

Resolution

It is important to identify that specific process using most count of opened files using. If the opened file count are legitimiate it is recommended to increase the ulimit count after consulting with the OS support team. Else, optimise the process to operate within the allotted file counts. In the below snippet, the lsof ouput will give the top 15 processes using most number of opened files:

Top 15 process with most count of opened files
Copy

If the lsof command is still not working on the affected nodes, Try below script which will list the opened file count, process-name and process-id of the top 15 processes:

Script
Copy

Sample output: [In the order opened file count, process-name and process-id]

Sample
Copy

Additional Information

Since the ulimit counts are Operating System level critical values, it is recommended to make changes in these values after consulting with the OS support team of respective users.

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard