Compute Nodes Showing as Offline in PCD UI
Problem
Compute nodes appear as offline in the PCD UI. Restarting the hostagent service shows significant delays (takes an unusually long time to stop and start the hostagent service, it takes longer than expected to stop and start.hostagent service). Also, the storage-related commands, such as lvs and multipath, hang or fail with errors.
Environment
- Private Cloud Director Virtualization - v2025.4 and Higher
- Self-Hosted Private Cloud Director Virtualization - v2025.4 and Higher
- Component: Compute Service
- Storage: External SAN/NAS accessed via multipath
Cause
The issue is caused by underlying storage connectivity problems leading to hung I/O operations. As a result, commands like lvs and multipath become unresponsive, impacting the hostagent service’s ability to report the node status correctly to the control plane.
Diagnostics
- Check Hostagent Restart Behavior
- If the below command hangs or takes several minutes, it suggests the agent is waiting for disk I/O operations or device responses from the storage layer.
- Delays here typically point to blocked I/O threads due to storage latency.
$ sudo systemctl restart pf9-hostagent- Verify Multipath Devices
- Running
sudo multipath -lldisplays faulty and failed paths similar to the following example:
- Running
% sudo multipath -ll[DEVICE_WWN] dm-[X] [STORAGE_VENDOR],[MODEL]size=[SIZE] features='...' hwhandler='...' wp=rw|-+- policy='service-time 0' prio=0 status=enabled| `- [CONTROLLER]:[CHANNEL]:[TARGET]:[LUN] [DEVICE_NAME] [MAJOR:MINOR] failed faulty running`-+- policy='service-time 0' prio=0 status=enabled `- [CONTROLLER]:[CHANNEL]:[TARGET]:[LUN] [DEVICE_NAME] [MAJOR:MINOR] failed faulty running- Check Logical Volumes
- Hangs or delays while executing the below command, indicates blocked LVM metadata access, which depends on I/O responsiveness from the underlying storage.
$ sudo lvs- Measure Storage I/O Performance: Use the following commands to verify and record evidence of slow or unresponsive storage.
- iostat
$ iostat -xmd 2 5-x : Display extended statistics (includes await, svctm, util).
-m : Display statistics in megabytes per second.
-d : Display only device utilization reports.
2 5 : Report every 2 seconds, for 5 iterations.
Look for:
await :Average time (ms) for I/O completion (high = slow disk).
%util:Device utilization (close to 100% = fully busy).
b. sar – Historical Device Utilization
$ sar -d 2 5-d : Report block device statistics.
2 5 : Report every 2 seconds, 5 times.
c. dd – Simple Sequential Read/Write Speed Test
d. fio - Detailed Random I/O Performance Test
Resolution
As an immediate recovery workaround:
- Reboot the affected host after checking the existing VMs running on the host.
If there are active running VMs present on the host, it is essential to schedule downtime during the host's reboot. This will help ensure that all virtual machines are safely shut down and avoid any potential data loss or interruptions.
- After reboot, if the host does not automatically come online, restart the following services:
$ sudo systemctl restart pf9-hostagent$ sudo systemctl restart pf9-comms$ sudo systemctl restart pf9-sidekickValidation
- Confirm that the compute node appears online in the PCD UI.
- Verify the
hostagentservice status:
$ systemctl status pf9-hostagent- Run
sudo multipath -lland ensure all paths are in active/ready state (no failed paths).
The customer should work with their storage team to investigate the root cause of the faulty or failed paths reported by multipath.
The storage team should:
- Verify SAN/LUN accessibility from the compute nodes.
- Check path redundancy and multipath configurations.
- Review storage-side logs for any errors or path instability.
- Ensure consistent and stable connectivity between the compute hosts and the storage array to prevent recurrence.
Additional Information
- This issue is typically linked to transient or persistent storage-side connectivity problems affecting multipath devices.
- Proactive monitoring of storage path health and regular validation of multipath configurations can help avoid future occurrences.