Volume Creation or Clone Requests Delayed Due to Storage Service Flapping

Problem

The Block Storage service on various hosts exhibits "flapping" behavior, where the service frequently transitions between up and down states in the scheduler. This causes volume tasks to fail or get stuck because the service is taking too long to report its status, sometimes taking over 15 minutes to finish a task that should only take seconds

Environment

  • Private Cloud Director Virtualization - v2025.4 and Higher

  • Self-Hosted Private Cloud Director Virtualization - v2025.4 and Higher

  • Component - Block Storage

Cause

The default Linux kernel setting for net.ipv4.tcp_retries2 is 15. In the event of temporary network latency or a hiccup between the Block Storage service and the RabbitMQ server, a value of 15 causes the TCP stack to wait between 13 to 30 minutes before timing out a broken connection. Because of this, the Block Storage service completely freezes. It stops sending heartbeats, which makes the Scheduler think the host is dead and causes the flapping (switching between UP and DOWN).

Diagnostics

Review the following logs to confirm this issue:

circle-info

In SaaS environments, Step 1 verification is handled by Platform9 on the backend. Please contact Platform9 Support for verification details if needed.

Step 1: Cinder Scheduler Logs (cinder-scheduler log)

Look for warnings indicating the volume service is down for specific host IDs:

WARNING cinder.scheduler.host_manager [None [req-UUID] [USER_ID] [PROJECT_ID] - - - -] volume service is down. (host: [STORAGE_UUID]@[BACKEND_STORAGE_HOST])

Step 2: Block Storage Logs (/var/log/pf9/cindervolume-base.log)

Look for report_state outlasting its interval:

Database/Messaging Errors (/var/log/pf9/cindervolume-base.log)

Identify intermittent DB connection losses or RabbitMQ heartbeat misses:

Resolution

In order to tune this , we update the kernel parameter net.ipv4.tcp_retries2 from 15 to 7 so the service stops hanging on dead connections and reconnects faster.

Step 1: Create the persistence configuration file

Step 2: Apply the parameter to the live kernel

Step 3: Restart the affected service

Step 4: Verify the change

Validation

  1. Verify Kernel Value: Run sysctl net.ipv4.tcp_retries2 and confirm the output is 7

  2. Check Persistence: Confirm /etc/sysctl.d/99-cinder-tuning.conf contains the correct value.

  3. Monitor Logs: Ensure /var/log/pf9/cindervolume-base.log no longer shows report_state warnings exceeding 60 seconds.

Additional Information

This issue is currently being worked on and is tracked under enhancement PCD-5500. Until an official recommendation is available, this manual tuning procedure must be followed for all affected Block Storage hosts.

Last updated