The vGPU Mapped Instances Fails to Boot With Error "Node device not found"
Problem
- After rebooting the vGPU capable Host as part of an upgrade/normal reboot, the vGPU enabled VMs are failing to start.
- This leads to
pf9-ostackhost
service not converging properly and any new VMs will not be getting spawned and older VMs are expected to be stuck in powering up state with below error:
INFO nova.virt.libvirt.host [req-ID None None] Secure Boot support detected
ERROR oslo_service.service [req-ID None None] Error starting thread.: libvirt.libvirtError: Node device not found: no node device with matching name 'mdev_[UUID-of-mediated-devices]'
TRACE oslo_service.service Traceback (most recent call last):
TRACE oslo_service.service File "/opt/pf9/venv/lib/python3.9/site-packages/oslo_service/service.py", line 810, in run_service
TRACE oslo_service.service service.start()
[..]
TRACE oslo_service.service raise libvirtError('virNodeDeviceLookupByName() failed')
TRACE oslo_service.service libvirt.libvirtError: Node device not found: no node device with matching name 'mdev_[UUID-of-mediated-devices]'
TRACE oslo_service.service
Environment
- Private Cloud Director Virtualisation - v2025.6 and Higher
- Self-Hosted Private Cloud Director Virtualisation - v2025.6 and Higher
- Component: GPU [NVIDIA drivers-
v570
andv580
]
Solution
- This is a known issue, to track the progress of the bug: PCD-2656, reach out to the Platform9 Support Team mentioning the bug ID.
- For remediation follow the provide details in the workaround section.
Root Cause
- The identified cause is that the mdev devices are not getting associated with the consumed vGPU because the interconnecting logic is yet to get worked out .
- As a result the VMs that were trying to go active were failing due to no device found error.
- This is a known issue, to track the progress of the bug: PCD-2656, reach out to the Platform9 Support Team mentioning the bug ID.
Workaround
Check the output of
$ mdev list
to see if it is empty to confirm this issue.Also the output of
$ lspci -nnn | grep -i nvidia
, does not list SRIOV devices.To resolve this issue, run the GPU Configuration Script in location
/opt/pf9/gpu.
While executing the script:- Move to the location having the script:
$ cd /opt/pf9/gpu/
. - Run the script
$ sudo ./pf9-gpu-configure.sh
with option3) vGPU SR-IOV configure.
- Run the script
pf9-gpu-configure.sh
with6) Validate vGPU
to check if the GPU is configured. - Re-run the
$ lspci -nnn | grep -i nvidia
which should now list all the VFs for the given GPU. - Run the
pf9-gpu-configure.sh
with option4) vGPU host configure
- From the UI under the
GPU host
section, the host should now be visible. - From the UI select the host and the required GPU profile for the host and continue to save the form.
- Monitor the UI to see the host completes the converge action.
- Move to the location having the script:
Post the
Step-3
, from the command line list out the UUID that are associated to the failing VMs instance from thepf9-ostackhost
logs.Identify UUID Errors- Check the
ostackhost
logs to identify UUIDs causing attachment errors.
TRACE oslo_service.service libvirt.libvirtError: Node device not found: no node device with matching name 'mdev_[UUID_OF_MEDIATED_DEVICES]'
- Map UUIDs to Bus IDs - Use the
echo
command to map UUIDs to the appropriate bus from vGPU host:
x
$ echo <UUIDs> > /sys/class/mdev_bus/<BUS_ID>/mdev_supported_types/nvidia-558/create
Example:
$ echo [UUID_OF_MEDIATED_DEVICES] > /sys/class/mdev_bus/0000:21:00.5/mdev_supported_types/nvidia-558/create
- Restart NVIDIA vGPU Manager- Restart the NVIDIA vGPU manager service and verify its status:
$ systemctl restart nvidia-vgpu-mgr
$ systemctl status nvidia-vgpu-mgr
- Restart the ostackhost service and monitor the status of those stuck VMs from UI.
$ systemctl restart pf9-ostackhost
Validation
List the newly added mdev devices using the below command:
- The
mdevctl list
gives non-empty output.
$ mdevctl list
Example:
$ mdevctl list
[UUID_OF_MEDIATED_DEVICES_1] 0000:21:00.5 nvidia-558d
[UUID_OF_MEDIATED_DEVICES_2] 0000:21:00.6 nvidia-558d
[UUID_OF_MEDIATED_DEVICES_3] 0000:21:00.7 nvidia-558d
- The vGPU enabled VMs will no longer be stuck in
powering-on
stuck state.
Was this page helpful?