GPU Host Convergence Failure Due to Invalid mdev_config Format

Problem

GPU-enabled compute hosts fail during convergence or upgrade while applying the pf9-ostackhost configuration.

The failure is observed during the set_config stage with errors similar to:

pf9_app.py INFO - Setting config for pf9-ostackhost.2026.1.2-76682026-04-23 22:08:11,217 - pf9_app.py ERROR - pf9-ostackhost:set_config failed:err: Expecting value: line 1 column 1 (char 0)command:/opt/pf9/hostagent/bin/python /opt/pf9/pf9-ostackhost/config --set-config '{...}'

Additional convergence failures may follow:

session.py ERROR - Exception during apps processing: <class 'pf9app.exceptions.ConfigOperationError'>session.py INFO - Converge failed

Environment

  • Private Cloud Director Virtualization - v2025.4 and Higher

  • Self-Hosted Private Cloud Director Virtualization - v2025.4 and Higher

  • GPU hosts

Cause

The issue occurs because the persisted role configuration for GPU hosts contains an invalid value for mdev_config.

The newer pf9-ostackhost configuration parser expects mdev_config to be a valid JSON array. However, some existing GPU hosts contain the value as an empty string:

"mdev_config": ""

During the converge or upgrade process, the parser attempts to decode this value as JSON and fails, resulting in:

This causes the pf9-ostackhost:set_config step to fail and prevents successful converge completion.

Diagnostics

  1. Check hostagent logs for the configuration parsing error:

Example output:

  1. Verify converge failure messages:

  1. Source credentials before generating the API token.

  2. Generate the Keystone token required for the API calls:

  1. Verify the token was generated successfully:

  1. Fetch the current role configuration from Resmgr:

  1. Inspect the returned payload for the following invalid field:

  1. Confirm that the host is GPU-enabled and contains mdev/vGPU configuration such as:

Resolution

Option 1 — Update using Resmgr API

This step is valid both for SAAS as well ask self hosted setups.

  1. Source the admin OpenStack credentials:

  1. Generate the Keystone token:

  1. Fetch the existing role configuration:

  1. Locate the following invalid entry in the payload:

  1. Replace it with a valid empty JSON array:

  1. Push the corrected configuration back to Resmgr using the PUT API.

Example:

  1. Retry the host converge or upgrade operation.

Option 2 — Update directly in the database

This step is valid ONLY for Self-Hosted setups where users have access to the DB.

  1. Take a backup of the Resmgr database before making any modifications.

  2. Connect to the Resmgr database.

  3. Verify the current host role settings:

  1. Locate the invalid entry:

  1. Update the value to an empty JSON list:

  1. Update the host role settings in the database with the corrected JSON payload.

Example:

  1. Retry the host converge or upgrade operation.

Validation

  1. Verify that the converge operation completes successfully.

  2. Confirm that the following error is no longer present:

  1. Confirm successful configuration application in the logs:

  1. Verify the host returns to Healthy/Ready state in the PCD UI.

Additional Information

  • This issue only affects GPU-enabled hosts using mediated device (mdev) configuration.

  • Non-GPU compute hosts are not impacted.

  • The issue originates from an invalid pre-existing persisted configuration and is not caused by the upgrade framework itself.

  • Setting mdev_config to a valid JSON list ([]) safely resolves the parsing issue while preserving normal GPU/vGPU functionality.

Last updated