Volume Services Flapping Due to EADDRINUSE in pf9-comms Service

Problem

The volume provisioning failed intermittently as the Storage services on one or more hosts were observed to be repeatedly transitioning between up and down states, leading to a degraded Persistent Storage Service status.

Environment

Private Cloud Director Virtualization - v2025.4 and Higher
Self-Hosted Private Cloud Director Virtualization - v2025.4 and Higher
Persistent Storage

Cause

The issue was caused by communication socket exhaustion on the affected node, leading to port binding failures. Specifically, the pf9-comms service was repeatedly logging the following error in /var/log/pf9/comms/comms.log:

Cinder Host
    
 
EADDRINUSE encountered during SNI client creation ... retrying in 20000 milliseconds
Copy

This indicates that the required port was already in use or in TIME_WAIT state, preventing successful socket binding and disrupting communication between services (including volume driver interactions with the backend).

Diagnostics

The following symptoms were observed:
- Theopenstack volume service list showed cinder-volume in down state intermittently.
- The /var/log/pf9/comms/comms.log contained repeated EADDRINUSE errors.
The /var/log/pf9/cindervolume-base logs on the cinder host showed API communication failures to the Tintri backend, including:

Cinder Host
    
reached maximum retry attempts: 2, reason: Volume driver reported an error:  (source: CinderDriver, typeId: com.tintri.api.rest.v310.dto.domain.beans.TintriError, code: ERR-API-8001)
Copy

Network and Tintri API endpoints were reachable, and credentials were valid, ruling out direct storage backend issues.

Resolution

Restarted the pf9-comms service on the affected Cinder volume host:

Cinder Host
    
 
$ sudo systemctl restart pf9-comms
Copy

Post-restart, the EADDRINUSE messages no longer appeared in comms.log.
The cinder-volume services returned to up state:

Command
    
​x
 
$ openstack volume service list -c Binary -c Host -c Status -c State​+-----------------+----------------------------------+---------+------+| Binary          | Host                             | Status  | State|+-----------------+----------------------------------+---------+------+| cinder-scheduler| cinder-scheduler-[ID]            | enabled | up   || cinder-volume   | [CINDER_HOST_ID]@[CINDER_BACKEND]| enabled | up   |+-----------------+----------------------------------+---------+------+
Copy

Verified volume provisioning functionality and ensured that backend API communication was stable.

Validation

Monitored the cinder-volume service status using the openstack volume service listfor 30 minutes and confirmed that the services remained in the up state.
No further EADDRINUSE errors were observed in comms.log.
Volume creation and attachment operations succeeded.

Additional Information

EADDRINUSE indicates that a socket binding operation failed due to the port already being in use.
This is often caused by:
- A large number of short-lived TCP connections
- Delays in socket cleanup
- Insufficient ephemeral port ranges or low timeout settings
Restarting the communication layer (pf9-comms) forces the cleanup of stale sockets and refreshes the SNI client sessions.

Last updated on

Was this page helpful?