The vGPU Mapped Instances Fails to Boot With Error "Node device not found"
Problem
- After rebooting the vGPU capable Host as part of an upgrade/normal reboot, the vGPU enabled VMs are failing to start.
- This leads to
pf9-ostackhostservice not converging properly and any new VMs will not be getting spawned and older VMs are expected to be stuck in powering up state with below error:
INFO nova.virt.libvirt.host [req-ID None None] Secure Boot support detected ERROR oslo_service.service [req-ID None None] Error starting thread.: libvirt.libvirtError: Node device not found: no node device with matching name 'mdev_[UUID-of-mediated-devices]' TRACE oslo_service.service Traceback (most recent call last): TRACE oslo_service.service File "/opt/pf9/venv/lib/python3.9/site-packages/oslo_service/service.py", line 810, in run_service TRACE oslo_service.service service.start()[..] TRACE oslo_service.service raise libvirtError('virNodeDeviceLookupByName() failed') TRACE oslo_service.service libvirt.libvirtError: Node device not found: no node device with matching name 'mdev_[UUID-of-mediated-devices]' TRACE oslo_service.serviceEnvironment
- Private Cloud Director Virtualisation - v2025.6 and Higher
- Self-Hosted Private Cloud Director Virtualisation - v2025.6 and Higher
- Component: GPU [NVIDIA drivers-
v570andv580]
Solution
- This is a known issue, to track the progress of the bug: PCD-2656, reach out to the Platform9 Support Team mentioning the bug ID.
- For remediation follow the provide details in the workaround section.
Root Cause
- The identified cause is that the mdev devices are not getting associated with the consumed vGPU because the interconnecting logic is yet to get worked out .
- As a result the VMs that were trying to go active were failing due to no device found error.
- This is a known issue, to track the progress of the bug: PCD-2656, reach out to the Platform9 Support Team mentioning the bug ID.
Workaround
Check the output of
$ mdev listto see if it is empty to confirm this issue.Also the output of
$ lspci -nnn | grep -i nvidia, does not list SRIOV devices.To resolve this issue, run the GPU Configuration Script in location
/opt/pf9/gpu.While executing the script:- Move to the location having the script:
$ cd /opt/pf9/gpu/. - Run the script
$ sudo ./pf9-gpu-configure.shwith option3) vGPU SR-IOV configure. - Run the script
pf9-gpu-configure.shwith6) Validate vGPUto check if the GPU is configured. - Re-run the
$ lspci -nnn | grep -i nvidiawhich should now list all the VFs for the given GPU. - Run the
pf9-gpu-configure.shwith option4) vGPU host configure - From the UI under the
GPU hostsection, the host should now be visible. - From the UI select the host and the required GPU profile for the host and continue to save the form.
- Monitor the UI to see the host completes the converge action.
- Move to the location having the script:
Post the
Step-3, from the command line list out the UUID that are associated to the failing VMs instance from thepf9-ostackhostlogs.Identify UUID Errors- Check the
ostackhostlogs to identify UUIDs causing attachment errors.
TRACE oslo_service.service libvirt.libvirtError: Node device not found: no node device with matching name 'mdev_[UUID_OF_MEDIATED_DEVICES]'- Map UUIDs to Bus IDs - Use the
echocommand to map UUIDs to the appropriate bus from vGPU host:
x
$ echo <UUIDs> > /sys/class/mdev_bus/<BUS_ID>/mdev_supported_types/nvidia-558/createExample:$ echo [UUID_OF_MEDIATED_DEVICES] > /sys/class/mdev_bus/0000:21:00.5/mdev_supported_types/nvidia-558/create- Restart NVIDIA vGPU Manager- Restart the NVIDIA vGPU manager service and verify its status:
$ systemctl restart nvidia-vgpu-mgr$ systemctl status nvidia-vgpu-mgr- Restart the ostackhost service and monitor the status of those stuck VMs from UI.
$ systemctl restart pf9-ostackhostValidation
List the newly added mdev devices using the below command:
- The
mdevctl listgives non-empty output.
$ mdevctl listExample:$ mdevctl list[UUID_OF_MEDIATED_DEVICES_1] 0000:21:00.5 nvidia-558d[UUID_OF_MEDIATED_DEVICES_2] 0000:21:00.6 nvidia-558d[UUID_OF_MEDIATED_DEVICES_3] 0000:21:00.7 nvidia-558d- The vGPU enabled VMs will no longer be stuck in
powering-onstuck state.
Was this page helpful?