Troubleshooting GPU Support

When you configure GPU support or deploy GPU-enabled VMs, you might encounter configuration errors or issues with GPU resource availability. The following troubleshooting information addresses specific problems that have been identified during GPU setup and operations.

Before troubleshooting GPU issues, verify that your GPU model is supported and that you have completed all infrastructure setup steps. Most GPU errors result from incomplete configuration or mismatched settings between hosts and flavors.

vGPU script fails with SR-IOV unbindLock error

When running the vGPU configuration script, you may encounter this error during SR-IOV configuration.

Bash
Copy

This error occurs when NVIDIA services are holding a lock on the PCI device, preventing SR-IOV configuration.

Prerequisites for this resolution

  • NVIDIA license and drivers are installed successfully
  • NVIDIA license server is created and configured

Resolution steps:

  1. Stop all NVIDIA-related services that might be holding the device lock:
Bash
Copy
  1. Remove and rescan the PCI device to reset its state:
Bash
Copy

Replace 0000:c1:00.0 with your actual PCI device ID from the error message.

  1. Manually enable SR-IOV for the GPU device:
Bash
Copy
  1. Restart the NVIDIA vGPU manager:
Bash
Copy
  1. Verify SR-IOV configuration by checking for virtual functions:
Bash
Copy

Expected output after successful resolution:

Bash
Copy

vGPU profiles missing during host configuration

When configuring vGPU on a GPU host, certain vGPU profiles may be missing from the available options, even though your GPU model supports them.

This occurs when existing mediated devices (stale or active) are present in /sys/bus/mdev/devices/. The system prevents selection of vGPU profiles that would create fewer or equal slices than the number of existing devices.

For example, if 2 devices are present at /sys/bus/mdev/devices/, any vGPU profile that allows 2 or fewer GPU slices will not be listed under GPU hosts for configuration.

Resolution steps:

  1. Check for existing mediated devices:
Bash
Copy
  1. Count the number of devices present:
Bash
Copy
  1. Stop NVIDIA vGPU services:
Bash
Copy
  1. Remove existing mediated devices:
Bash
Copy
  1. Restart NVIDIA vGPU services:
Bash
Copy
  1. Verify the devices are cleared:
Bash
Copy
  1. Return to the GPU host configuration in PCD to see the full list of available vGPU profiles.

VM creation fails with GPU model validation error

If you select a GPU model in the VM flavor that doesn't match the configured GPU on the host, the system will provide an error during VM creation. This ensures that the selected GPU in the flavor aligns with the underlying GPU hardware.

To resolve this issue, verify that the GPU model specified in your flavor matches the GPU model configured on the host.

GPU host authorization and visibility issues

GPU host does not appear as a listed GPU host

After running the GPU configuration script and authorizing, the GPU host should appear in the GPU host list showing compatibility (passthrough or vGPU), GPU model, and device ID.

If your GPU host does not appear:

  • Verify the GPU configuration script ran successfully on the host.
  • Confirm you rebooted the host after running the script.
  • Check that host authorization completed.
  • Ensure both host configuration and cluster have GPU enabled.

Script execution requirements

The GPU configuration script requires administrator privileges to run. Only administrators with access to the host and script should execute it. End users or developers requesting GPU resources do not need to run the script themselves.

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard
  Last updated
GPU, PassthroughvGPUTroubleshooting GPUflavors