# The vGPU Mapped Instances Fails to Boot With Error "Node device not found"

## Problem

* After rebooting the vGPU capable Host as part of an upgrade/normal reboot, the vGPU enabled VMs are failing to start.
* This leads to `pf9-ostackhost` service not converging properly and any new VMs will not be getting spawned and older VMs are expected to be stuck in powering up state with below error:

```bash
 INFO nova.virt.libvirt.host [req-ID None None] Secure Boot support detected
 ERROR oslo_service.service [req-ID None None] Error starting thread.: libvirt.libvirtError: Node device not found: no node device with matching name 'mdev/[UUID_OF_MEDIATED_DEVICES]'
 TRACE oslo_service.service Traceback (most recent call last):
 TRACE oslo_service.service File "/opt/pf9/venv/lib/python3.9/site-packages/oslo/service/service.py", line 810, in run_service
 TRACE oslo_service.service service.start()
[..]
 TRACE oslo_service.service raise libvirtError('virNodeDeviceLookupByName() failed')
 TRACE oslo_service.service libvirt.libvirtError: Node device not found: no node device with matching name 'mdev/[UUID_OF_MEDIATED_DEVICES]'
 TRACE oslo_service.service
```

## Environment

* Private Cloud Director Virtualisation - till v2025.10-115
* Self-Hosted Private Cloud Director Virtualisation - till v2025.10-115
* Component: GPU \[NVIDIA drivers- `v570` and `v580`]

## Solution

* This is a known issue tracked under ***PCD-2656*** and is now fixed from **v2025.10-180 and above versons.**
* For remediation, follow the provided details in the workaround section.

## Root Cause

* The identified cause is that the mdev devices are not getting associated with the consumed vGPU because the interconnecting logic is yet to get worked out .
* As a result the VMs that were trying to go active were failing due to no device found error.

## Workaround

1. Check the output of `$ mdev list` to see if it is empty to confirm this issue.
2. Also the output of `$ lspci -nnn | grep -i nvidia` , does not list SRIOV devices.
3. To resolve this issue, run the GPU Configuration Script in location`/opt/pf9/gpu.`While executing the script:
   1. Move to the location having the script: `$ cd /opt/pf9/gpu/` .
   2. Run the script `$ sudo ./pf9-gpu-configure.sh` with option `3) vGPU SR-IOV configure.`
   3. Run the script `pf9-gpu-configure.sh` with `6) Validate vGPU` to check if the GPU is configured.
   4. Re-run the `$ lspci -nnn | grep -i nvidia` which should now list all the VFs for the given GPU.
   5. Run the `pf9-gpu-configure.sh` with option `4) vGPU host configure`
   6. From the UI under the `GPU host` section, the host should now be visible.
   7. From the UI select the host and the required GPU profile for the host and continue to save the form.
   8. Monitor the UI to see the host completes the converge action.
4. Post the `Step-3`, from the command line list out the UUID that are associated to the failing VMs instance from the `pf9-ostackhost` logs.
5. Identify UUID Errors- Check the `ostackhost` logs to identify UUIDs causing attachment errors.

```bash
TRACE oslo_service.service libvirt.libvirtError: Node device not found: no node device with matching name 'mdev_[UUID_OF_MEDIATED_DEVICES]'
```

6. Map UUIDs to Bus IDs - Use the `echo` command to map UUIDs to the appropriate bus from vGPU host:

```bash
$ echo <UUIDs> | /sys/class/mdev/bus/<BUS_ID>/mdev_supported_types/nvidia-558/create

Example:
$ echo [UUID_OF_MEDIATED_DEVICES] | sudo tee /sys/bus/pci/devices/0000:21:00.4/mdev_supported_types/nvidia-558/create
```

7. Restart NVIDIA vGPU Manager- Restart the NVIDIA vGPU manager service and verify its status:

```bash
$ systemctl restart nvidia-vgpu-mgr
$ systemctl status nvidia-vgpu-mgr
```

8. Restart the ostackhost service and monitor the status of those stuck VMs from UI.

```bash
$ systemctl restart pf9-ostackhost
```

## Validation

List the newly added mdev devices using the below command:

1. The `mdevctl list` gives non-empty output.

```ruby
$ mdevctl list

Example:
$ mdevctl list
[UUID_OF_MEDIATED_DEVICES_1] 0000:21:00.5 nvidia-558d
[UUID_OF_MEDIATED_DEVICES_2] 0000:21:00.6 nvidia-558d
[UUID_OF_MEDIATED_DEVICES_3] 0000:21:00.7 nvidia-558d
```

2. The vGPU enabled VMs will no longer be stuck in `powering-on` stuck state.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://platform9.com/kb/pcd/gpu/the-vgpu-mapped-instances-fails-to-boot-with-error-node-device-not-found.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
