Volume Services Flapping Due to EADDRINUSE in pf9-comms Service

Problem

The volume provisioning failed intermittently as the Storage services on one or more hosts were observed to be repeatedly transitioning between up and down states, leading to a degraded Persistent Storage Service status.

Environment

  • Private Cloud Director Virtualization - v2025.4 and Higher
  • Self-Hosted Private Cloud Director Virtualization - v2025.4 and Higher
  • Persistent Storage

Cause

The issue was caused by communication socket exhaustion on the affected node, leading to port binding failures. Specifically, the pf9-comms service was repeatedly logging the following error in /var/log/pf9/comms/comms.log:

Cinder Host
Copy

This indicates that the required port was already in use or in TIME_WAIT state, preventing successful socket binding and disrupting communication between services (including volume driver interactions with the backend).

Diagnostics

  1. The following symptoms were observed:

    • Theopenstack volume service list showed cinder-volume in down state intermittently.
    • The /var/log/pf9/comms/comms.log contained repeated EADDRINUSE errors.
  2. The /var/log/pf9/cindervolume-base logs on the cinder host showed API communication failures to the Tintri backend, including:

Cinder Host
Copy

Network and Tintri API endpoints were reachable, and credentials were valid, ruling out direct storage backend issues.

Resolution

  1. Restarted the pf9-comms service on the affected Cinder volume host:
Cinder Host
Copy
  1. Post-restart, the EADDRINUSE messages no longer appeared in comms.log.
  2. The cinder-volume services returned to up state:
Command
Copy
  1. Verified volume provisioning functionality and ensured that backend API communication was stable.

Validation

  • Monitored the cinder-volume service status using the openstack volume service listfor 30 minutes and confirmed that the services remained in the up state.
  • No further EADDRINUSE errors were observed in comms.log.
  • Volume creation and attachment operations succeeded.

Additional Information

  • EADDRINUSE indicates that a socket binding operation failed due to the port already being in use.

  • This is often caused by:

    • A large number of short-lived TCP connections
    • Delays in socket cleanup
    • Insufficient ephemeral port ranges or low timeout settings
  • Restarting the communication layer (pf9-comms) forces the cleanup of stale sockets and refreshes the SNI client sessions.

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard