Volume Services Flapping Due to EADDRINUSE in pf9-comms Service
Problem
The volume provisioning failed intermittently as the Storage services on one or more hosts were observed to be repeatedly transitioning between up
and down
states, leading to a degraded Persistent Storage Service status.
Environment
- Private Cloud Director Virtualization - v2025.4 and Higher
- Self-Hosted Private Cloud Director Virtualization - v2025.4 and Higher
- Persistent Storage
Cause
The issue was caused by communication socket exhaustion on the affected node, leading to port binding failures. Specifically, the pf9-comms
service was repeatedly logging the following error in /var/log/pf9/comms/comms.log
:
EADDRINUSE encountered during SNI client creation ... retrying in 20000 milliseconds
This indicates that the required port was already in use or in TIME_WAIT state, preventing successful socket binding and disrupting communication between services (including volume driver interactions with the backend).
Diagnostics
The following symptoms were observed:
- The
openstack volume service list
showed cinder-volume in down state intermittently. - The
/var/log/pf9/comms/comms.log
contained repeatedEADDRINUSE
errors.
- The
The
/var/log/pf9/cindervolume-base
logs on the cinder host showed API communication failures to the Tintri backend, including:
reached maximum retry attempts: 2, reason: Volume driver reported an error: (source: CinderDriver, typeId: com.tintri.api.rest.v310.dto.domain.beans.TintriError, code: ERR-API-8001)
Network and Tintri API endpoints were reachable, and credentials were valid, ruling out direct storage backend issues.
Resolution
- Restarted the pf9-comms service on the affected Cinder volume host:
$ sudo systemctl restart pf9-comms
- Post-restart, the EADDRINUSE messages no longer appeared in comms.log.
- The
cinder-volume
services returned toup
state:
$ openstack volume service list -c Binary -c Host -c Status -c State
+-----------------+----------------------------------+---------+------+
| Binary | Host | Status | State|
+-----------------+----------------------------------+---------+------+
| cinder-scheduler| cinder-scheduler-[ID] | enabled | up |
| cinder-volume | [CINDER_HOST_ID]@[CINDER_BACKEND]| enabled | up |
+-----------------+----------------------------------+---------+------+
- Verified volume provisioning functionality and ensured that backend API communication was stable.
Validation
- Monitored the
cinder-volume
service status using theopenstack volume service list
for 30 minutes and confirmed that the services remained in theup
state. - No further
EADDRINUSE
errors were observed incomms.log
. - Volume creation and attachment operations succeeded.
Additional Information
EADDRINUSE
indicates that a socket binding operation failed due to the port already being in use.This is often caused by:
- A large number of short-lived TCP connections
- Delays in socket cleanup
- Insufficient ephemeral port ranges or low timeout settings
Restarting the communication layer (
pf9-comms
) forces the cleanup of stale sockets and refreshes the SNI client sessions.