Volume Services Flapping Due to EADDRINUSE in pf9-comms Service
Problem
The volume provisioning failed intermittently as the Storage services on one or more hosts were observed to be repeatedly transitioning between up and down states, leading to a degraded Persistent Storage Service status.
Environment
Private Cloud Director Virtualization - v2025.4 and Higher
Self-Hosted Private Cloud Director Virtualization - v2025.4 and Higher
Component: Persistent Storage
Cause
The issue was caused by communication socket exhaustion on the affected node, leading to port binding failures. Specifically, the pf9-comms service was repeatedly logging the following error in /var/log/pf9/comms/comms.log:
EADDRINUSE encountered during SNI client creation ... retrying in 20000 millisecondsThis indicates that the required port was already in use or in TIME_WAIT state, preventing successful socket binding and disrupting communication between services (including volume driver interactions with the backend).
Diagnostics
The following symptoms were observed:
The
openstack volume service listshowed cinder-volume in down state intermittently.The
/var/log/pf9/comms/comms.logcontained repeatedEADDRINUSEerrors.
The
/var/log/pf9/cindervolume-baselogs on the cinder host showed API communication failures to the Tintri backend, including:
Network and Tintri API endpoints were reachable, and credentials were valid, ruling out direct storage backend issues.
Resolution
Restarted the pf9-comms service on the affected Cinder volume host:
Post-restart, the EADDRINUSE messages no longer appeared in comms.log.
The
cinder-volumeservices returned toupstate:
Verified volume provisioning functionality and ensured that backend API communication was stable.
Validation
Monitored the
cinder-volumeservice status using theopenstack volume service listfor 30 minutes and confirmed that the services remained in theupstate.No further
EADDRINUSEerrors were observed incomms.log.Volume creation and attachment operations succeeded.
Additional Information
EADDRINUSEindicates that a socket binding operation failed due to the port already being in use.This is often caused by:
A large number of short-lived TCP connections
Delays in socket cleanup
Insufficient ephemeral port ranges or low timeout settings
Restarting the communication layer (
pf9-comms) forces the cleanup of stale sockets and refreshes the SNI client sessions.
Last updated
