Volume Services Flapping Due to EADDRINUSE in pf9-comms Service
Problem
The volume provisioning failed intermittently as the Storage services on one or more hosts were observed to be repeatedly transitioning between up and down states, leading to a degraded Persistent Storage Service status.
Environment
- Private Cloud Director Virtualization - v2025.4 and Higher
- Self-Hosted Private Cloud Director Virtualization - v2025.4 and Higher
- Persistent Storage
Cause
The issue was caused by communication socket exhaustion on the affected node, leading to port binding failures. Specifically, the pf9-comms service was repeatedly logging the following error in /var/log/pf9/comms/comms.log:
EADDRINUSE encountered during SNI client creation ... retrying in 20000 millisecondsThis indicates that the required port was already in use or in TIME_WAIT state, preventing successful socket binding and disrupting communication between services (including volume driver interactions with the backend).
Diagnostics
The following symptoms were observed:
- The
openstack volume service listshowed cinder-volume in down state intermittently. - The
/var/log/pf9/comms/comms.logcontained repeatedEADDRINUSEerrors.
- The
The
/var/log/pf9/cindervolume-baselogs on the cinder host showed API communication failures to the Tintri backend, including:
reached maximum retry attempts: 2, reason: Volume driver reported an error: (source: CinderDriver, typeId: com.tintri.api.rest.v310.dto.domain.beans.TintriError, code: ERR-API-8001)Network and Tintri API endpoints were reachable, and credentials were valid, ruling out direct storage backend issues.
Resolution
- Restarted the pf9-comms service on the affected Cinder volume host:
$ sudo systemctl restart pf9-comms- Post-restart, the EADDRINUSE messages no longer appeared in comms.log.
- The
cinder-volumeservices returned toupstate:
$ openstack volume service list -c Binary -c Host -c Status -c State+-----------------+----------------------------------+---------+------+| Binary | Host | Status | State|+-----------------+----------------------------------+---------+------+| cinder-scheduler| cinder-scheduler-[ID] | enabled | up || cinder-volume | [CINDER_HOST_ID]@[CINDER_BACKEND]| enabled | up |+-----------------+----------------------------------+---------+------+- Verified volume provisioning functionality and ensured that backend API communication was stable.
Validation
- Monitored the
cinder-volumeservice status using theopenstack volume service listfor 30 minutes and confirmed that the services remained in theupstate. - No further
EADDRINUSEerrors were observed incomms.log. - Volume creation and attachment operations succeeded.
Additional Information
EADDRINUSEindicates that a socket binding operation failed due to the port already being in use.This is often caused by:
- A large number of short-lived TCP connections
- Delays in socket cleanup
- Insufficient ephemeral port ranges or low timeout settings
Restarting the communication layer (
pf9-comms) forces the cleanup of stale sockets and refreshes the SNI client sessions.