The vGPU Mapped Instances Fails to Boot With Error "Node device not found"

Problem

  • After rebooting the vGPU capable Host as part of an upgrade/normal reboot, the vGPU enabled VMs are failing to start.

  • This leads to pf9-ostackhost service not converging properly and any new VMs will not be getting spawned and older VMs are expected to be stuck in powering up state with below error:

 INFO nova.virt.libvirt.host [req-ID None None] Secure Boot support detected
 ERROR oslo_service.service [req-ID None None] Error starting thread.: libvirt.libvirtError: Node device not found: no node device with matching name 'mdev/[UUID_OF_MEDIATED_DEVICES]'
 TRACE oslo_service.service Traceback (most recent call last):
 TRACE oslo_service.service File "/opt/pf9/venv/lib/python3.9/site-packages/oslo/service/service.py", line 810, in run_service
 TRACE oslo_service.service service.start()
[..]
 TRACE oslo_service.service raise libvirtError('virNodeDeviceLookupByName() failed')
 TRACE oslo_service.service libvirt.libvirtError: Node device not found: no node device with matching name 'mdev/[UUID_OF_MEDIATED_DEVICES]'
 TRACE oslo_service.service

Environment

  • Private Cloud Director Virtualisation - till v2025.10-115

  • Self-Hosted Private Cloud Director Virtualisation - till v2025.10-115

  • Component: GPU [NVIDIA drivers- v570 and v580]

Solution

  • This is a known issue tracked under PCD-2656 and is now fixed from v2025.10-180 and above versons.

  • For remediation, follow the provided details in the workaround section.

Root Cause

  • The identified cause is that the mdev devices are not getting associated with the consumed vGPU because the interconnecting logic is yet to get worked out .

  • As a result the VMs that were trying to go active were failing due to no device found error.

Workaround

  1. Check the output of $ mdev list to see if it is empty to confirm this issue.

  2. Also the output of $ lspci -nnn | grep -i nvidia , does not list SRIOV devices.

  3. To resolve this issue, run the GPU Configuration Script in location/opt/pf9/gpu.While executing the script:

    1. Move to the location having the script: $ cd /opt/pf9/gpu/ .

    2. Run the script $ sudo ./pf9-gpu-configure.sh with option 3) vGPU SR-IOV configure.

    3. Run the script pf9-gpu-configure.sh with 6) Validate vGPU to check if the GPU is configured.

    4. Re-run the $ lspci -nnn | grep -i nvidia which should now list all the VFs for the given GPU.

    5. Run the pf9-gpu-configure.sh with option 4) vGPU host configure

    6. From the UI under the GPU host section, the host should now be visible.

    7. From the UI select the host and the required GPU profile for the host and continue to save the form.

    8. Monitor the UI to see the host completes the converge action.

  4. Post the Step-3, from the command line list out the UUID that are associated to the failing VMs instance from the pf9-ostackhost logs.

  5. Identify UUID Errors- Check the ostackhost logs to identify UUIDs causing attachment errors.

  1. Map UUIDs to Bus IDs - Use the echo command to map UUIDs to the appropriate bus from vGPU host:

  1. Restart NVIDIA vGPU Manager- Restart the NVIDIA vGPU manager service and verify its status:

  1. Restart the ostackhost service and monitor the status of those stuck VMs from UI.

Validation

List the newly added mdev devices using the below command:

  1. The mdevctl list gives non-empty output.

  1. The vGPU enabled VMs will no longer be stuck in powering-on stuck state.

Last updated