Compute Nodes Showing as Offline in PCD UI

Problem

Compute nodes appear as offline in the PCD UI. Restarting the hostagent service shows significant delays (takes an unusually long time to stop and start the hostagent service, it takes longer than expected to stop and start.hostagent service). Also, the storage-related commands, such as lvs and multipath, hang or fail with errors.

Environment

Private Cloud Director Virtualization - v2025.4 and Higher
Self-Hosted Private Cloud Director Virtualization - v2025.4 and Higher
Component: Compute Service
Storage: External SAN/NAS accessed via multipath

Cause

The issue is caused by underlying storage connectivity problems leading to hung I/O operations. As a result, commands like lvs and multipath become unresponsive, impacting the hostagent service’s ability to report the node status correctly to the control plane.

Diagnostics

Check Hostagent Restart Behavior
1. If the below command hangs or takes several minutes, it suggests the agent is waiting for disk I/O operations or device responses from the storage layer.
2. Delays here typically point to blocked I/O threads due to storage latency.

Affected Compute Node
    
 
$ sudo systemctl restart pf9-hostagent
Copy

Verify Multipath Devices
1. Running sudo multipath -ll displays faulty and failed paths similar to the following example:

Affected Compute Node
    
 
% sudo multipath -ll[DEVICE_WWN] dm-[X] [STORAGE_VENDOR],[MODEL]size=[SIZE] features='...' hwhandler='...' wp=rw|-+- policy='service-time 0' prio=0 status=enabled| `- [CONTROLLER]:[CHANNEL]:[TARGET]:[LUN] [DEVICE_NAME]  [MAJOR:MINOR] failed faulty running`-+- policy='service-time 0' prio=0 status=enabled  `- [CONTROLLER]:[CHANNEL]:[TARGET]:[LUN] [DEVICE_NAME]  [MAJOR:MINOR] failed faulty running
Copy

Check Logical Volumes
1. Hangs or delays while executing the below command, indicates blocked LVM metadata access, which depends on I/O responsiveness from the underlying storage.

Affected Compute Node
    
 
$ sudo lvs
Copy

Measure Storage I/O Performance: Use the following commands to verify and record evidence of slow or unresponsive storage.
1. iostat

Affected Compute Node
    
 
$ iostat -xmd 2 5
Copy

-x : Display extended statistics (includes await, svctm, util).

-m : Display statistics in megabytes per second.

-d : Display only device utilization reports.

2 5 : Report every 2 seconds, for 5 iterations.

Look for:

await :Average time (ms) for I/O completion (high = slow disk). %util:Device utilization (close to 100% = fully busy). b. sar – Historical Device Utilization

Affected Compute Node
    
 
$ sar -d 2 5
Copy

-d : Report block device statistics.

2 5 : Report every 2 seconds, 5 times.

c. dd – Simple Sequential Read/Write Speed Test

d. fio - Detailed Random I/O Performance Test

Resolution

As an immediate recovery workaround:

Reboot the affected host after checking the existing VMs running on the host.

If there are active running VMs present on the host, it is essential to schedule downtime during the host's reboot. This will help ensure that all virtual machines are safely shut down and avoid any potential data loss or interruptions.

After reboot, if the host does not automatically come online, restart the following services:

Affected Compute Node
    
 
$ sudo systemctl restart pf9-hostagent$ sudo systemctl restart pf9-comms$ sudo systemctl restart pf9-sidekick
Copy

Validation

Confirm that the compute node appears online in the PCD UI.
Verify the hostagent service status:

Affected Compute Node
    
 
$ systemctl status pf9-hostagent
Copy

Run sudo multipath -ll and ensure all paths are in active/ready state (no failed paths).

The customer should work with their storage team to investigate the root cause of the faulty or failed paths reported by multipath. The storage team should:

Verify SAN/LUN accessibility from the compute nodes.
Check path redundancy and multipath configurations.
Review storage-side logs for any errors or path instability.
Ensure consistent and stable connectivity between the compute hosts and the storage array to prevent recurrence.

Additional Information

This issue is typically linked to transient or persistent storage-side connectivity problems affecting multipath devices.
Proactive monitoring of storage path health and regular validation of multipath configurations can help avoid future occurrences.

Last updated on

Was this page helpful?