Host Decommission: Incomplete Host Cleanup

This article explains how to safely decommission (remove) a host from your Platform9 Private Cloud Director environment using the pcdctl decommission-node command.

Problem

After executing the command pcdctl decommission-node --force on host, the following packages weren't cleanly removed/still present in the host :

$ systemctl list-units --all | grep pf9
 pf9-libvirt-exporter.service                 not-found   inactive   dead      pf9-libvirt-exporter.service
 pf9-neutron-ovn-metadata-agent.service       not-found   active     running   pf9-neutron-ovn-metadata-agent.service
 pf9-node-exporter.service                    not-found   inactive   dead      pf9-node-exporter.service
 pf9-prometheus.service                       not-found   inactive   dead      pf9-prometheus.service
 pf9-remote-write.service                     not-found   inactive   dead      pf9-remote-write.service
 pf9-sidekick.service                         not-found   inactive   dead      pf9-sidekick.service

Environment

  • Component - pcdctl

Answer

This is a known issue that is tracked in PCD-3504. Use the workaround until the fix is integrated to the product.

Workaround

The workaround is to manually clean the stale packages using the below steps:

  1. Disable the affected service from the impacted host:

  1. Try to stop if any of the services are stoppable:

  1. Remove unit files and drop-ins (common locations)

  1. Drop-in directories if present:

  1. Reload systemd state

  1. Purge leftover packages

  1. Remove residual binaries/configs

  1. Verify if the stale packages are cleaned up using:

Overview

This article explains how to safely decommission (remove) a host from your Platform9 Private Cloud Director environment using the pcdctl decommission-node command. Decommissioning removes all Platform9 software, configurations, and data from a host, returning it to a clean state.


Prerequisites

Before decommissioning a host, ensure:

  1. All roles have been removed from the host (see Removing Roles)

  2. You have pcdctl installed on the host you want to decommission

  3. You have sudo/root access to the host

  4. Any workloads/VMs have been migrated off the host


Step 1: Remove Roles from the Host

Before decommissioning, you must remove all roles from the host. Use the deauthorize-node command:

What this does:

  • Removes all roles (hypervisor, image-library, etc.) from the host

  • Waits for role removal to complete (up to timeout)

  • Updates the database to reflect the changes


Step 2: Decommission the Host

Once all roles are removed, decommission the host:

What this does:

  • Verifies no roles are assigned to the host

  • Removes all Platform9 packages (pf9-hostagent, pf9-comms, etc.)

  • Deletes Platform9 configuration files from /etc/pf9

  • Removes Platform9 data from /opt/pf9 and /var/opt/pf9

  • Cleans up OVS bridges and network configurations

  • Resets the system to a clean state


Forceful decommission

Warning: Do not use the --force option by default. It is intended only for emergency scenarios. Do not skip role removal, as this can create orphaned data on the management plane. We do NOT recommend the use of --force option unless its absolutely necessary.

❌ DON'T

  1. Don't use --force by default - it's for emergencies only

  2. Don't skip role removal - it creates orphaned data

  3. Don't interrupt decommission - let it complete

  4. Don't use --force without understanding the consequences

  5. Don't decommission production nodes without approval

  6. Always document the reason for decommissioning (maintenance logs)

  7. Don't reuse the node immediately - verify it's clean first

Case: Emergency or Management Plane Unreachable (Use --force)

Scenario: You need to decommission a host in one of these situations:

  • Host is corrupted or broken

  • Previous decommission failed

  • Management plane is unreachable and you need to proceed anyway

  • Emergency cleanup required

Solution: Use the --force flag:


When to Use --force

SAFE to use when:

  1. Management plane is unreachable AND you already removed roles

    • Network issues prevent management plane access

    • You verified roles were removed before management plane went down

  2. Node is broken/corrupted

    • Previous decommission attempt failed

    • Packages are corrupted

    • Services won't stop cleanly

  3. Emergency cleanup

    • Host is being immediately retired

    • Security incident requiring immediate cleanup

    • Host crashed and needs forced cleanup

DO NOT use when:

  1. Normal operations - Use standard workflow (deauthorize-nodedecommission-node)

  2. As a shortcut - Don't skip deauthorize-node out of convenience

  3. Unsure about role status - If you don't know if roles exist, check first

  4. Management plane is accessible - Use normal decommission to ensure clean removal


Decision Matrix

Your Situation
Command to Use
What Happens

Normal: Roles removed, management plane accessible

pcdctl decommission-node

✅ Safe verification, stops on errors

Emergency: Node broken

pcdctl decommission-node --force

⚠️ Continues through errors, aggressive cleanup

Management plane down: Roles already removed

pcdctl decommission-node --force

⚠️ Asks "PROCEED WITHOUT VERIFICATION"

Management plane accessible: Roles still assigned

pcdctl deauthorize-node first!

✅ Remove roles properly, then decommission


Troubleshooting

Problem: "Cannot proceed: Roles are still assigned"

Error message:

Solution: Remove roles first using deauthorize-node:

If deauthorize is stuck or failing: Use --force to proceed anyway (will require manual management plane cleanup if roles are present):


Problem: Decommission fails with errors

Error message:

Cause:

  • OVS packages already removed

  • Corrupted installation

  • Missing dependencies

Solution: First try to solve the error manually, if that does not work then use --force mode to continue through errors:

Force mode will:

  • Log the error as a warning

  • Continue with remaining cleanup

  • Show summary of all issues at end


Problem: "Role check timed out" or "Could not verify roles"

Symptoms:

Causes:

  • Management plane is down or unreachable

  • Network connectivity issues

  • Firewall blocking access

  • Management plane is very slow to respond

Solution:

If you already removed roles:

If you haven't removed roles yet:


Problem: Deauthorize is stuck at "Waiting for roles to be deleted"

Symptoms:

Causes:

  • BBMaster service down on Management plane

  • RabbitMQ connectivity issues

  • pf9-hostagent service down on host

  • Role stuck in CONVERGING/DELETING state

Solutions:

Option 1: Increase timeout

Option 2: Check services

Option 3: Force decommission

Last updated