Compute Nodes Showing as Offline in PCD UI

Problem

Compute nodes appear as offline in the PCD UI. Restarting the hostagent service shows significant delays (takes an unusually long time to stop and start the hostagent service, it takes longer than expected to stop and start.hostagent service). Also, the storage-related commands, such as lvs and multipath, hang or fail with errors.

Environment

  • Private Cloud Director Virtualization - v2025.4 and Higher
  • Self-Hosted Private Cloud Director Virtualization - v2025.4 and Higher
  • Component: Compute Service
  • Storage: External SAN/NAS accessed via multipath

Cause

The issue is caused by underlying storage connectivity problems leading to hung I/O operations. As a result, commands like lvs and multipath become unresponsive, impacting the hostagent service’s ability to report the node status correctly to the control plane.

Diagnostics

  1. Check Hostagent Restart Behavior
    1. If the below command hangs or takes several minutes, it suggests the agent is waiting for disk I/O operations or device responses from the storage layer.
    2. Delays here typically point to blocked I/O threads due to storage latency.
Affected Compute Node
Copy
  1. Verify Multipath Devices
    1. Running sudo multipath -ll displays faulty and failed paths similar to the following example:
Affected Compute Node
Copy
  1. Check Logical Volumes
    1. Hangs or delays while executing the below command, indicates blocked LVM metadata access, which depends on I/O responsiveness from the underlying storage.
Affected Compute Node
Copy
  1. Measure Storage I/O Performance: Use the following commands to verify and record evidence of slow or unresponsive storage.
    1. iostat
Affected Compute Node
Copy

-x : Display extended statistics (includes await, svctm, util).

-m : Display statistics in megabytes per second.

-d : Display only device utilization reports.

2 5 : Report every 2 seconds, for 5 iterations.

Look for:

await :Average time (ms) for I/O completion (high = slow disk). %util:Device utilization (close to 100% = fully busy). b. sar – Historical Device Utilization

Affected Compute Node
Copy

-d : Report block device statistics.

2 5 : Report every 2 seconds, 5 times.

c. dd – Simple Sequential Read/Write Speed Test

d. fio - Detailed Random I/O Performance Test

Resolution

As an immediate recovery workaround:

  • Reboot the affected host after checking the existing VMs running on the host.

If there are active running VMs present on the host, it is essential to schedule downtime during the host's reboot. This will help ensure that all virtual machines are safely shut down and avoid any potential data loss or interruptions.

  • After reboot, if the host does not automatically come online, restart the following services:
Affected Compute Node
Copy

Validation

  • Confirm that the compute node appears online in the PCD UI.
  • Verify the hostagent service status:
Affected Compute Node
Copy
  • Run sudo multipath -ll and ensure all paths are in active/ready state (no failed paths).

The customer should work with their storage team to investigate the root cause of the faulty or failed paths reported by multipath. The storage team should:

  • Verify SAN/LUN accessibility from the compute nodes.
  • Check path redundancy and multipath configurations.
  • Review storage-side logs for any errors or path instability.
  • Ensure consistent and stable connectivity between the compute hosts and the storage array to prevent recurrence.

Additional Information

  • This issue is typically linked to transient or persistent storage-side connectivity problems affecting multipath devices.
  • Proactive monitoring of storage path health and regular validation of multipath configurations can help avoid future occurrences.
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard