The vGPU Mapped Instances Fails to Boot With Error "Node device not found"

Problem

  • After rebooting the vGPU capable Host as part of an upgrade/normal reboot, the vGPU enabled VMs are failing to start.
  • This leads to pf9-ostackhost service not converging properly and any new VMs will not be getting spawned and older VMs are expected to be stuck in powering up state with below error:
ostackhost.log
Copy

Environment

  • Private Cloud Director Virtualisation - v2025.6 and Higher
  • Self-Hosted Private Cloud Director Virtualisation - v2025.6 and Higher
  • Component: GPU [NVIDIA drivers- v570 and v580]

Solution

  • This is a known issue, to track the progress of the bug: PCD-2656, reach out to the Platform9 Support Team mentioning the bug ID.
  • For remediation follow the provide details in the workaround section.

Root Cause

  • The identified cause is that the mdev devices are not getting associated with the consumed vGPU because the interconnecting logic is yet to get worked out .
  • As a result the VMs that were trying to go active were failing due to no device found error.
  • This is a known issue, to track the progress of the bug: PCD-2656, reach out to the Platform9 Support Team mentioning the bug ID.

Workaround

  1. Check the output of $ mdev list to see if it is empty to confirm this issue.

  2. Also the output of $ lspci -nnn | grep -i nvidia , does not list SRIOV devices.

  3. To resolve this issue, run the GPU Configuration Script in location/opt/pf9/gpu.While executing the script:

    1. Move to the location having the script: $ cd /opt/pf9/gpu/ .
    2. Run the script $ sudo ./pf9-gpu-configure.sh with option 3) vGPU SR-IOV configure.
    3. Run the script pf9-gpu-configure.sh with 6) Validate vGPU to check if the GPU is configured.
    4. Re-run the $ lspci -nnn | grep -i nvidia which should now list all the VFs for the given GPU.
    5. Run the pf9-gpu-configure.sh with option 4) vGPU host configure
    6. From the UI under the GPU host section, the host should now be visible.
    7. From the UI select the host and the required GPU profile for the host and continue to save the form.
    8. Monitor the UI to see the host completes the converge action.
  4. Post the Step-3, from the command line list out the UUID that are associated to the failing VMs instance from the pf9-ostackhost logs.

  5. Identify UUID Errors- Check the ostackhost logs to identify UUIDs causing attachment errors.

OstackHost log
Copy
  1. Map UUIDs to Bus IDs - Use the echo command to map UUIDs to the appropriate bus from vGPU host:
Command
Copy
  1. Restart NVIDIA vGPU Manager- Restart the NVIDIA vGPU manager service and verify its status:
Command
Copy
  1. Restart the ostackhost service and monitor the status of those stuck VMs from UI.
Command
Copy

Validation

List the newly added mdev devices using the below command:

  1. The mdevctl list gives non-empty output.
Command
Copy
  1. The vGPU enabled VMs will no longer be stuck in powering-on stuck state.
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard